Home


The explainable, auditable prompt optimizer
Rules you can read. Gains you can trust. No fine-tuning.
Overview
Constitutional AI steers LLM behavior through natural-language rules, but writing those rules by hand is hard, and getting them right is harder. Existing prompt optimizers try to automate this but fall short: they need many labeled examples, produce opaque prompt edits, and hit diminishing returns as prompts grow.
MAC fixes this. It optimizes over structured sets of rules using a network of specialized agents that propose, edit, and validate rule updates against a held-out set. The result is a human-readable constitution with no fine-tuning and no gradient updates.
Real rules MAC learned for PII tagging
Legal: "Mark as private specific dates when they appear in the context of personal events or actions, such as births, deaths, or significant life events. Do not mark general references or narrative text." Examples: mark "1975", "22 August 2003"; do not mark "on a day in June".
Healthcare: "Mark terms such as heart failure subtypes (e.g., diastolic heart failure, systolic heart failure) when explicitly mentioned in a patient's medical history as private. Do not mark generic medical conditions without an explicit subtype."
Finance: "Mark as private any phrase indicating a specific financial timeframe (e.g., FY2022, YTD FY2021) when it appears in direct association with identifiable information. Do not mark standalone labels without specific identifiers."
Explainable Structured Auditable Transferable Sample-efficient
- Explainable - every rule is natural language you can read, audit, and hand-edit
- Structured - optimizes over a set of rules, not a monolithic prompt blob
- Auditable - each rule change is validated on a held-out set before acceptance
- Transferable - rules learned on one model work on another without retraining
- Sample-efficient - far fewer labeled examples than GEPA or MIPRO to converge
Four agents coordinate in a loop: Annotator (applies rules) → Decision Agent (analyzes errors) → Rule Proposer (drafts new rules) / Rule Editor (refines existing ones). A meta-model rewrites all agent prompts so MAC generalizes to any downstream task. Just provide examples and a metric.
MAC vs Prompt Optimizers
Domain-specific PII tagging across Legal, Finance, and Healthcare documents using Qwen2.5-Instruct models at 3B, 7B, and 14B scales.
| Dataset | Method | 3B | 7B | 14B |
|---|---|---|---|---|
| Legal | GEPA | 12.7 | 52.1 | 50.1 |
| Legal | MIPRO | 13.2 | 38.6 | 44.3 |
| Legal | MAC | 36.0 | 55.1 | 67.3 |
| Finance | GEPA | 11.9 | 22.5 | 28.8 |
| Finance | MIPRO | 9.8 | 22.3 | 26.8 |
| Finance | MAC | 30.1 | 37.5 | 45.5 |
| Healthcare | GEPA | 16.5 | 12.9 | 16.8 |
| Healthcare | MIPRO | 12.5 | 16.8 | 20.6 |
| Healthcare | MAC | 9.7 | 20.1 | 26.7 |
8 of 9 configurations MAC wins. Largest gain: Legal 3B +174% over next best.
MAC vs Pretrained Taggers (14B)
| Domain | MAC | Best Baseline | Gain |
|---|---|---|---|
| Legal | 67.3 | 57.3 (Presidio) | +17% |
| Finance | 45.5 | 44.7 (Presidio) | +2% |
| Healthcare | 26.7 | 12.0 (GLiNER) | +123% |
Training Dynamics
Validation F1 over training batches on 3B models (ECHR dataset). MAC steadily improves while baselines plateau or fluctuate.
General-Purpose Benchmarks
MAC is task-agnostic. A meta-model rewrites all agent prompts to fit any downstream task. Below are results on three diverse benchmarks.
HoVer (Fact Verification)
| Setup | Worker | Style | Baseline | Best | Delta |
|---|---|---|---|---|---|
| API + Cloud MAC | gpt-4o-mini | adapt | 25% | 88% | +63% |
| Local + Cloud MAC | Qwen3-8B | custom | 62% | 88% | +26% |
| Local + Cloud MAC | Qwen3-8B | adapt | 69% | 81% | +12% |
| Fully Local | Qwen3-8B | custom | 75% | 81% | +6% |
| API + Cloud MAC | gpt-4o-mini | custom | 88% | 88% | 0% |
| Fully Local | Qwen3-8B | adapt | 75% | 75% | 0% |
GSM8K (Math Reasoning)
| Setup | Worker | Style | Baseline | Best | Delta |
|---|---|---|---|---|---|
| Local + Cloud MAC | Qwen3-8B | adapt | 94% | 100% | +6% |
| Local + Cloud MAC | Qwen3-8B | custom | 94% | 100% | +6% |
| Fully Local | Qwen3-8B | custom | 94% | 100% | +6% |
| API + Cloud MAC | gpt-4o-mini | adapt | 100% | 100% | 0% |
| API + Cloud MAC | gpt-4o-mini | custom | 94% | 94% | 0% |
| Fully Local | Qwen3-8B | adapt | 100% | 100% | 0% |
HotpotQA (Multi-Hop QA)
| Setup | Worker | Style | Baseline | Best | Delta |
|---|---|---|---|---|---|
| Fully Local | Qwen3-8B | custom | 22% | 36% | +14% |
| Local + Cloud MAC | Qwen3-8B | adapt | 29% | 38% | +9% |
| API + Cloud MAC | gpt-4o-mini | adapt | 25% | 34% | +9% |
| Local + Cloud MAC | Qwen3-8B | custom | 29% | 36% | +7% |
| API + Cloud MAC | gpt-4o-mini | custom | 26% | 32% | +6% |
| Fully Local | Qwen3-8B | adapt | 27% | 27% | 0% |
Any Task, Any Model
MAC is not tied to a single domain. Give it a task_description, a few examples, and a scoring function, and it learns a constitution for classification, extraction, math, QA, tool calling, or anything else you can evaluate. A meta-model automatically rewrites every agent prompt to fit the new task, so you never write task-specific plumbing.
Under the hood MAC uses a three-tier model setup. A cheap worker does annotation, strong MAC agents learn the rules, and an optional adapt model rewrites prompts:
compiler = MAC(
model="Qwen/Qwen3-8B", # Tier 1: worker (cheap / local)
base_url="http://localhost:8000/v1",
mac_model="gpt-5.2", # Tier 2: MAC agents (strong)
# adapt_model="gpt-5.2", # Tier 3: defaults to mac_model
task_description="Solve AIME competition math problems.",
rule_type="math reasoning rules",
)
See Model Configuration for the full fallback cascade and provider examples.
Citation
@article{thareja2025mac,
title={MAC: Multi-Agent Constitution Learning for Generalizable Text Annotation},
author={Thareja, Rushil},
year={2025}
}
License
MAC is released under a dual license:
- Non-commercial use (academic research, education, personal projects): free
- Commercial use: requires a separate license. Contact
rushil.thareja@mbzuai.ac.ae