Documentation

Full research write-up — hypothesis, methodology, results across 5 experimental phases, and key findings.

Project Overview

Intelligence Arbitrage is a systematic experiment testing whether small, locally-running language models (2–14B parameters), when optimized with DSPy's MIPROv2 prompt optimizer, can match the performance of frontier API models (GPT-4o, GPT-5.2).

The core idea: frontier models are priced uniformly across all tasks, but many production tasks don't require frontier intelligence. If optimization can close the gap on specific tasks, teams can route work to cheap local models and pay frontier prices only where they're genuinely necessary.

175

Evaluation runs

Models tested

Task domains

Experiment phases

Hypothesis

DSPy's optimization pipeline (teacher bootstrapping + MIPROv2) can close the performance gap between local open-weight models (3B–14B parameters) and frontier API models (GPT-4o, GPT-5.2) on structured NLP tasks.

The "arbitrage" framing: if a $0/eval local model can match a $0.01+/eval API model on a specific task after optimization, you capture the accuracy without paying the inference cost. The larger the performance gap that optimization closes, the larger the arbitrage opportunity.

Key Findings

Intelligence arbitrage is real. Optimized local models (3–14B, $0/eval) match or exceed GPT-5.2 baseline on 5 of 9 tasks: extraction (+40pp), QA (+8pp), synthesis (+7pp), RAG (+2pp), and code (tied).

Optimization is a force multiplier for weaker models. qwen2.5:7b gained +14.2pp average (Phase 1, n=50), gpt-4o +7.9pp. The smaller the model, the more prompt optimization helps.
14 rank inversions. After optimization, local models beat API models in 14 of 36 possible matchups. The most dramatic: mistral classification (54pp swing past gpt-4o).
MIPROv2's all-or-nothing demo selection is suboptimal. It always chose 0 or 5 demos, never intermediate. Phase 3 proved phi4 peaks at n=1–2 on most tasks. A granular strategy could recover 2–8pp.
Two failure modes. Instruction damage (optimized instructions hurt: phi4 rag −8pp) and demo interference (too many examples: phi4 analysis inverted-U peaking at n=2). These are independent and can compound.
Optimization response is task-specific. Tasks explain 2× more variance than models (ratio 0.50). Code always 0pp, math always big gains — regardless of model.
Demo count optima are model-specific. On math, llama3.2 needs all 5 demos (monotonic), phi4 gets +20pp from demo #1 alone. Same task, different optimal strategy.
Extra compute per inference doesn't help (Phase 6a). Best-of-3, Majority Class Confidence, and Iterative Refinement showed no statistically significant improvement over vanilla CoT (all McNemar's p > 0.05, n=200).
Two confirmed significant findings (Bonferroni-corrected, n=200). Mistral + MIPROv2 on Agentic: +13.5pp (p<0.0001). GPT-5.2 + MIPROv2 on Synthesis: +12.0pp (p<0.0001). All other unified experiment results were within noise.

Optimization Lift by Model (Phase 1)

Model	Params	Avg Baseline	Avg Optimized	Avg Delta	Tasks Improved
qwen2.5:7b	7B	51.7%	65.9%	+14.2pp	8/9
mistral	7B	42.1%	55.7%	+13.6pp	6/9
llama3.2	3B	49.7%	60.2%	+10.5pp	7/9
gpt-5.2	API	68.0%	77.1%	+9.1pp	5/9
phi4	14B	56.1%	65.0%	+8.8pp	5/9
gpt-4o	API	63.3%	71.2%	+7.9pp	5/9

Intelligence Arbitrage Scoreboard

Local optimized vs GPT-5.2 baseline:

Task	GPT-5.2 Baseline	Best Local Optimized	Verdict
extraction	44%	phi4: 84% (+40pp)	Massive
qa	22%	phi4: 30% (+8pp)	Strong
synthesis	40%	mistral: 47% (+7pp)	Strong
rag	88%	llama3.2: 90% (+2pp)	Matched
code	90%	phi4: 90% (tied)	Matched
agentic	86%	mistral: 84% (−2pp)	Near-match
analysis	88%	qwen2.5: 84% (−4pp)	Close
math	90%	phi4: 84% (−6pp)	Gap remains
classification	64%	mistral: 54% (−10pp)	Gap remains

Experimental Setup

9 task categories: agentic (StrategyQA), analysis (BoolQ), classification (Banking77), code (HumanEval), extraction (CoNLL-2003), math (GSM8K), QA (HotPotQA), RAG (SQuAD), synthesis (XSum)
8 models across phases: Local (Phase 1): llama3.2 (3B), mistral (7B), qwen2.5:7b (7B), phi4 (14B). API (Phase 1): gpt-4o, gpt-5.2. Reasoning (Phase 4): qwen3:4b (4B), deepseek-r1:7b (7B).
Optimizer: MIPROv2 with 3 seeds (10, 20, 30), best-of-3 selected by dev score. GPT-4o as teacher model.
Evaluation: 50 examples per task (Phase 1), 200 examples (Unified experiment). Temperature=0, separate train/eval splits verified.
Total evaluations: 108 (Phase 1) + 42 (Phase 3) + 17 (Phase 4) + 8 (Phase 5) = 175 runs

Task Domains

Domain	Dataset	Metric
Classification	Banking77	Accuracy
Math	GSM8K	Exact Match
QA	HotPotQA	Exact Match
Extraction	CoNLL-2003	F1 Score
Analysis	BoolQ	Accuracy
Synthesis	XSum	F1 Score (w/ copy penalty)
Agentic	StrategyQA	Accuracy
RAG	Custom (SQuAD)	Faithfulness
Code	MBPP → HumanEval	Pass@1 (cross-dataset)

Phase 1: Baseline vs Optimized (6 Models × 9 Tasks)

MIPROv2 optimization applied to 6 models across all 9 task domains. Each condition evaluated on n=50 examples.

Optimization Delta Matrix

Percentage-point change from optimization (positive = improved). Large positives (math, extraction, QA) represent genuine gains; deltas ≤±4pp are within the noise floor at n=50.

Task	llama3.2	mistral	phi4	qwen2.5:7b	gpt-4o	gpt-5.2
agentic	+12	+28	−8	+6	+4	−4
analysis	+10	+5	−4	+6	+11	0
classification	+3	+44	−2	+20	−10	−2
code	0	0	0	0	0	−2
extraction	+16	−2	+32	+42	+28	+50
math	+26	+30	+36	+36	+28	+8
qa	+20	+18	+22	+12	+20	+20
rag	+8	−3	+2	+6	−2	+4
synthesis	−0.3	+2.4	+1.4	+0.1	−7.6	+8.3

Statistical note: At n=50, standard error ≈ 7pp. Deltas of ±4pp or smaller are not statistically distinguishable from zero. Only large deltas (≥10pp) should be treated as directionally reliable.

Rank Inversions

14 cases where an optimized local model outperformed an optimized API model. Most dramatic examples:

Task	Local (optimized)	API (optimized)	Total swing from baseline
classification	mistral 54%	gpt-4o 46%	54pp swing (was 46pp behind)
agentic	mistral 84%	gpt-5.2 82%	32pp swing (was 30pp behind)
code	phi4 90%	gpt-4o 76%	28pp swing (was already 14pp ahead)
extraction	phi4 84%	gpt-4o 74%	14pp swing
synthesis	all 4 locals	gpt-4o 35.7%	gpt-4o regressed −7.6pp, all locals beat it

Phase 2: Variance Analysis — Task-Specific or Model-Specific?

Result: Tasks determine 2× more variance than models. Within-task variance across models (102.5) is half within-model variance across tasks (205.2). Variance ratio = 0.50.

This means the task type predicts whether optimization will help more reliably than the model choice. Code always shows 0pp delta. Math always shows big gains. These patterns hold regardless of which model you use.

Task Consistency Tiers

Tier	Tasks	Cross-model std	Interpretation
Task-driven	code (0.7), qa (3.2), rag (4.3)	Low	All models respond similarly
Weakly task-driven	synthesis (4.8), analysis (5.0), math (9.8)	Medium	Some model-dependent variation
Model-dependent	classification (18.2), extraction (17.0), agentic (12.8)	High	Response depends heavily on the model

MIPROv2 also showed an all-or-nothing demo count pattern: 30 programs used 5 demos, 24 used 0 demos. No intermediate counts were ever chosen — suggesting MIPROv2 uses a binary heuristic rather than fine-grained demo selection.

Phase 3: Demo Curve Experiment

For 6 (model, task) pairs where Phase 1 showed negative or near-zero deltas, we progressively evaluated performance at n=0, 1, 2, 3, 4, 5 demonstrations — decomposing optimization into instruction effect and demo effect.

Key Decomposition Results

Model	Task	Instruction Effect	Demo Effect	Diagnosis
phi4	analysis	−4pp (hurt)	−2pp (hurt)	Both hurt — double penalty
phi4	rag	−8pp (hurt)	+8pp (helped)	Instructions damaged, demos compensated
phi4	math	+4pp (helped)	+30pp (helped)	Both helped
llama3.2	math	+6pp (helped)	+26pp (helped)	Both helped, monotonic

Two distinct failure modes identified:
(1) Instruction damage — optimized instructions confuse the model (phi4 rag: −8pp from instructions alone).
(2) Demo interference — too many examples degrade performance (phi4 analysis: peaks at n=2, falls to 76% at n=5).

Demo count optima are also model-specific: on math, llama3.2 shows monotonic improvement through all 5 demos, while phi4 gets +20pp from demo #1 alone and shows diminishing returns after that.

Phase 4: Reasoning Model Baselines

Two reasoning models — qwen3:4b and deepseek-r1:7b — evaluated as unoptimized baselines across all 9 tasks.

Complementary Profiles

Strength area	qwen3:4b	deepseek-r1:7b	Winner
Math	50%	32%	qwen3:4b (+18pp)
RAG	88%	74%	qwen3:4b (+14pp)
Classification	14%	64%	deepseek-r1:7b (+50pp)
Extraction	20%	66%	deepseek-r1:7b (+46pp)
Analysis	68%	82%	deepseek-r1:7b (+14pp)

deepseek-r1:7b's classification score (64%) matches GPT-5.2's baseline without any optimization. However, its per-example latency is 13–18 seconds vs qwen3:4b's ~95ms — a critical production tradeoff.

Phase 5: Optimization Tipping Point

Tests: at what number of labeled examples does optimization start working? Uses BootstrapFewShotWithRandomSearch on qwen3:4b with GPT-4o as teacher.

Task	Baseline	n=0	n=1	Result
math	50%	50%	64% (+14pp)	Tipping point at n=1
classification	34%	34%	34%	Resists optimization entirely
qa	12%	12%	10%	Optimization degraded (program failed)

These three tasks represent the three possible outcomes of optimization: responsive (math jumps +14pp from a single example), resistant (classification: 3.7 hours, zero gain), and broken (QA program failed to compile at n=0 and n=1).

Methodology & Limitations

Optimization Protocol

MIPROv2 with GPT-4o as teacher model
Three seeds (10, 20, 30) run independently; best-of-3 by dev score selected
Each optimized program contains rewritten task instructions + 0 or 5 curated demonstrations
Binary demo count is MIPROv2's decision, not experimenter-chosen

Statistical Limitations

Phase 1 noise floor (n=50): Standard error ≈ 7pp. Deltas smaller than ±4pp are statistically indistinguishable from zero. Only large deltas (≥10pp) should be treated as directionally reliable. The unified experiment (n=200) cuts the noise floor to ~3.5pp.

DSPy version confound: Phase 1 results span DSPy 2.6.x (llama3.2, mistral) and DSPy 3.1.3 (phi4, qwen2.5, gpt-5.2). Baseline discrepancies of 2–4pp exist across versions and cannot be fully attributed to model differences alone.

Synthesis metric discontinuity: Phase 1 synthesis scores (40–49%) and unified experiment scores (81.5–98%) use different evaluation metrics. Direct comparison across phases is not valid for this task.

Two Confirmed Significant Findings (Unified Experiment, n=200)

★ Mistral + MIPROv2, Agentic: +13.5pp (p<0.0001) — 53.0% → 66.5%. Largest confirmed gain in the study. CI: [5.3, 21.7].

★ GPT-5.2 + MIPROv2, Synthesis: +12.0pp (p<0.0001) — 81.5% → 93.5%. CI: [4.4, 19.6]. Note: uses different metric than Phase 1.

Technical Stack

Component	Technology	Role
Optimization	DSPy (MIPROv2)	Prompt instruction + demo optimization
Local inference	Ollama	Llama 3.2, Qwen 2.5, Mistral, Phi-4, DeepSeek-R1, Qwen3
API baselines	OpenAI API	GPT-4o, GPT-5.2
Teacher model	GPT-4o	Generates optimized instructions and demonstrations
Datasets	HuggingFace Datasets	Banking77, GSM8K, HotPotQA, CoNLL-2003, BoolQ, XSum, StrategyQA, MBPP, HumanEval
Data analysis	Pandas	Result aggregation and analysis
Dashboard	Chart.js + React	Interactive visualization

Optimization Loop

Teacher Bootstrapping: GPT-4o generates high-quality few-shot examples for the target task
MIPROv2: Optimizes task instructions and selects demonstrations for the student model
Variance Reduction: 3 seeds (10, 20, 30); best-of-3 selected by dev set accuracy

Interactive Dashboards

The experiment results are visualized across three interactive pages:

◾ Dashboard — Full results, all phases, optimization delta heatmap, rank inversions ◾ Scatterplot — Accuracy × cost tradeoff, arbitrage zone visualization ◾ Calculator — ROI calculator: your volume, your models, actual cost savings