Home

The explainable, auditable prompt optimizer

Rules you can read. Gains you can trust. No fine-tuning.

+123% Healthcare PII

+17% Legal PII

+63% HoVer

100% GSM8K

8 of 9 Beats GEPA/MIPRO

Overview

Constitutional AI steers LLM behavior through natural-language rules, but writing those rules by hand is hard, and getting them right is harder. Existing prompt optimizers try to automate this but fall short: they need many labeled examples, produce opaque prompt edits, and hit diminishing returns as prompts grow.

MAC fixes this. It optimizes over structured sets of rules using a network of specialized agents that propose, edit, and validate rule updates against a held-out set. The result is a human-readable constitution with no fine-tuning and no gradient updates.

Real rules MAC learned for PII tagging

Legal: "Mark as private specific dates when they appear in the context of personal events or actions, such as births, deaths, or significant life events. Do not mark general references or narrative text." Examples: mark "1975", "22 August 2003"; do not mark "on a day in June".

Healthcare: "Mark terms such as heart failure subtypes (e.g., diastolic heart failure, systolic heart failure) when explicitly mentioned in a patient's medical history as private. Do not mark generic medical conditions without an explicit subtype."

Finance: "Mark as private any phrase indicating a specific financial timeframe (e.g., FY2022, YTD FY2021) when it appears in direct association with identifiable information. Do not mark standalone labels without specific identifiers."

Explainable Structured Auditable Transferable Sample-efficient

Explainable - every rule is natural language you can read, audit, and hand-edit
Structured - optimizes over a set of rules, not a monolithic prompt blob
Auditable - each rule change is validated on a held-out set before acceptance
Transferable - rules learned on one model work on another without retraining
Sample-efficient - far fewer labeled examples than GEPA or MIPRO to converge

MAC system overview showing 4 agents iteratively learning constitution rules

Four agents coordinate in a loop: Annotator (applies rules) → Decision Agent (analyzes errors) → Rule Proposer (drafts new rules) / Rule Editor (refines existing ones). A meta-model rewrites all agent prompts so MAC generalizes to any downstream task. Just provide examples and a metric.

MAC vs Prompt Optimizers

Domain-specific PII tagging across Legal, Finance, and Healthcare documents using Qwen2.5-Instruct models at 3B, 7B, and 14B scales.

MAC vs GEPA vs MIPRO F1 by domain and model scale

Dataset	Method	3B	7B	14B
Legal	GEPA	12.7	52.1	50.1
Legal	MIPRO	13.2	38.6	44.3
Legal	MAC	36.0	55.1	67.3
Finance	GEPA	11.9	22.5	28.8
Finance	MIPRO	9.8	22.3	26.8
Finance	MAC	30.1	37.5	45.5
Healthcare	GEPA	16.5	12.9	16.8
Healthcare	MIPRO	12.5	16.8	20.6
Healthcare	MAC	9.7	20.1	26.7

8 of 9 configurations MAC wins. Largest gain: Legal 3B +174% over next best.

MAC vs Pretrained Taggers (14B)

Domain	MAC	Best Baseline	Gain
Legal	67.3	57.3 (Presidio)	+17%
Finance	45.5	44.7 (Presidio)	+2%
Healthcare	26.7	12.0 (GLiNER)	+123%

Training Dynamics

Validation F1 over training batches on 3B models (ECHR dataset). MAC steadily improves while baselines plateau or fluctuate.

Validation F1 vs training batches on 3B models

General-Purpose Benchmarks

MAC is task-agnostic. A meta-model rewrites all agent prompts to fit any downstream task. Below are results on three diverse benchmarks.

HoVer (Fact Verification)

HoVer results

Setup	Worker	Style	Baseline	Best	Delta
API + Cloud MAC	gpt-4o-mini	adapt	25%	88%	+63%
Local + Cloud MAC	Qwen3-8B	custom	62%	88%	+26%
Local + Cloud MAC	Qwen3-8B	adapt	69%	81%	+12%
Fully Local	Qwen3-8B	custom	75%	81%	+6%
API + Cloud MAC	gpt-4o-mini	custom	88%	88%	0%
Fully Local	Qwen3-8B	adapt	75%	75%	0%

GSM8K (Math Reasoning)

GSM8K results

Setup	Worker	Style	Baseline	Best	Delta
Local + Cloud MAC	Qwen3-8B	adapt	94%	100%	+6%
Local + Cloud MAC	Qwen3-8B	custom	94%	100%	+6%
Fully Local	Qwen3-8B	custom	94%	100%	+6%
API + Cloud MAC	gpt-4o-mini	adapt	100%	100%	0%
API + Cloud MAC	gpt-4o-mini	custom	94%	94%	0%
Fully Local	Qwen3-8B	adapt	100%	100%	0%

HotpotQA (Multi-Hop QA)

HotpotQA results

Setup	Worker	Style	Baseline	Best	Delta
Fully Local	Qwen3-8B	custom	22%	36%	+14%
Local + Cloud MAC	Qwen3-8B	adapt	29%	38%	+9%
API + Cloud MAC	gpt-4o-mini	adapt	25%	34%	+9%
Local + Cloud MAC	Qwen3-8B	custom	29%	36%	+7%
API + Cloud MAC	gpt-4o-mini	custom	26%	32%	+6%
Fully Local	Qwen3-8B	adapt	27%	27%	0%

Any Task, Any Model

MAC is not tied to a single domain. Give it a task_description, a few examples, and a scoring function, and it learns a constitution for classification, extraction, math, QA, tool calling, or anything else you can evaluate. A meta-model automatically rewrites every agent prompt to fit the new task, so you never write task-specific plumbing.

Under the hood MAC uses a three-tier model setup. A cheap worker does annotation, strong MAC agents learn the rules, and an optional adapt model rewrites prompts:

compiler = MAC(
    model="Qwen/Qwen3-8B",           # Tier 1: worker (cheap / local)
    base_url="http://localhost:8000/v1",
    mac_model="gpt-5.2",              # Tier 2: MAC agents (strong)
    # adapt_model="gpt-5.2",          # Tier 3: defaults to mac_model
    task_description="Solve AIME competition math problems.",
    rule_type="math reasoning rules",
)

See Model Configuration for the full fallback cascade and provider examples.

Citation

@article{thareja2025mac,
  title={MAC: Multi-Agent Constitution Learning for Generalizable Text Annotation},
  author={Thareja, Rushil},
  year={2025}
}

License

MAC is released under a dual license:

Non-commercial use (academic research, education, personal projects): free
Commercial use: requires a separate license. Contact rushil.thareja@mbzuai.ac.ae