Intelligence Arbitrage

Can DSPy-optimized small models match frontier APIs? Results from 175 evaluation runs across 8 models, 9 tasks, and 5 experimental phases.

5/9

Tasks where local models match GPT-5.2

Rank inversions (local > API after optimization)

+44pp

Largest single optimization lift (mistral, classification)

Cost per local model evaluation

Phase 1: Optimization Delta Heatmap

Percentage-point change from MIPROv2 optimization across 6 models and 9 tasks. Green = improvement, red = regression.

Optimization Lift by Model

Average accuracy gain from MIPROv2 optimization. Smaller models benefit disproportionately.

Intelligence Arbitrage Scoreboard

GPT-5.2 baseline vs best optimized local model. Tasks sorted by arbitrage opportunity.

Rank Inversions

Cases where a $0 local model beat a paid API model after optimization — despite starting behind.

gpt-4o: The Optimization Victim

gpt-4o has 3 of 12 negative deltas — tied with phi4 and gpt-5.2 (3 each). Its regressions are the most severe in magnitude (-10pp classification, -7.6pp synthesis).

Phase 3: Demo Curve Experiment

How accuracy changes as we progressively add 0-5 few-shot demonstrations to MIPROv2 programs.

phi4 — Demo Interference

Analysis peaks at n=2, RAG at n=1. More demos actively hurt.

llama3.2 — Monotonic Benefit

Math improves with every demo. Synthesis already at ceiling.

Instruction Damage vs Demo Effect

Decomposing optimization into its two components: the effect of optimized instructions alone (n=0 vs baseline) and the added effect of demos (n=5 vs n=0).

Phase 2: Task-Specific vs Model-Specific

Within-task variance is half within-model variance — tasks determine WHETHER optimization helps, models determine HOW MUCH.

Cross-Model Standard Deviation by Task

Low std = all models agree. High std = model identity matters.

Gap Compression vs Divergence

Baseline range vs optimized range across models per task.

Phase 4: Reasoning Model Baselines

Two reasoning models show complementary profiles — qwen3:4b excels at math/rag, deepseek-r1:7b at classification/extraction.

Phase 5: Optimization Tipping Point

How many labeled examples are needed before optimization fires? BootstrapFewShotWithRandomSearch on qwen3:4b, n=50 test examples, GPT-4o teacher.

qwen3:4b — Baseline vs n=0 vs n=1 Optimized

Math jumps +14pp from a single example. Classification and QA optimization programs failed to compile — identical scores at every count.

Key finding: A single labeled example is the tipping point for math (+14pp). Classification resisted optimization entirely (3.7 hrs, zero gain). QA program failed to compile at n=0 and n=1. These three tasks represent the three possible outcomes: responsive, resistant, and broken.

Phase 6a: Module Comparison (CoT vs Best-of-3 vs MCC)

Does 3× inference compute (Best-of-3 sampling, Majority Class Confidence, Iterative Refinement) beat vanilla CoT? phi4 and llama3.2, 4 tasks, n=200. McNemar's exact test.

phi4

CoT is competitive across all tasks. No significant differences (all p > 0.05).

llama3.2

Same null result. Refine-3 shows no systematic advantage over a single CoT pass.

Null result: Paying 3× the inference cost via multi-sample methods does not reliably improve accuracy over a single CoT call for these tasks and model sizes. The best module per task varies randomly — no method dominates consistently.

Phase 6b: GEPA vs MIPROv2

GEPA (Gradient-Enhanced Prompt Adaptation) vs MIPROv2 vs CoT baseline on phi4, 4 tasks, n=200. Bonferroni-corrected α = 0.0025.

phi4 — Three-Way Optimizer Comparison

One statistically significant result: GEPA on RAG (+5pp vs CoT baseline, p=0.006). All other comparisons non-significant after Bonferroni correction.

One confirmed improvement: GEPA + phi4 on RAG: 83.5% → 88.5% (+5pp, p=0.006). This is the only case where GEPA outperformed CoT at statistical significance. MIPROv2 gained +2.5pp on RAG (p=0.27, not significant). All agentic and analysis results were flat or within noise.

Unified Experiment: High-Confidence Results

5 models × 4 tasks × n=200 examples each. MIPROv2 vs CoT baseline. McNemar's exact test, Bonferroni-corrected α = 0.0025. Two findings confirmed significant (★).

MIPROv2 Lift — All 20 Conditions (n=200 each)

Bright bars = Bonferroni-significant (★ p<0.0001). Most conditions show no significant improvement or change within noise.

★ Mistral + MIPROv2, Agentic: +13.5pp (p<0.0001) — 53.0% → 66.5%. Largest confirmed gain in the study. CI: [5.3, 21.7].

★ GPT-5.2 + MIPROv2, Synthesis: +12.0pp (p<0.0001) — 81.5% → 93.5%. CI: [4.4, 19.6]. Note: synthesis uses a different evaluation metric than Phase 1 (see Statistical Notes below).

Unified Experiment — Delta Heatmap (n=200)

Percentage-point change from MIPROv2 optimization. ★ = Bonferroni-significant. Dashes indicate no change.

Statistical Notes

Phase 1 noise floor (n=50): At n=50 binary outcomes, standard error ≈ 7pp. Deltas smaller than ±4pp are statistically indistinguishable from zero. Only large Phase 1 deltas (≥10pp) should be treated as directionally reliable. The unified experiment (n=200) cuts the noise floor to ~3.5pp.

DSPy version confound: Phase 1 baselines and llama3.2/mistral optimized programs used DSPy 2.6.x. phi4, qwen2.5, and gpt-5.2 optimized programs used DSPy 3.1.3. Baseline discrepancies of 2–4pp exist across versions and cannot be fully attributed to model differences alone.

Synthesis metric discontinuity: Phase 1 synthesis scores (40–49%) and unified experiment scores (81.5–98%) use different evaluation metrics. Direct comparison across phases is not valid for this task.

Key Findings

Intelligence arbitrage is real. Optimized local models (3-14B, $0/eval) match or exceed GPT-5.2 baseline on 5/9 tasks: extraction (+40pp), QA (+8pp), synthesis (+7pp), RAG (+2pp), and code (tied).
Optimization is a force multiplier for weaker models. qwen2.5:7b gained +14.2pp average (Phase 1, n=50), gpt-4o +7.9pp. The smaller the model, the more prompt optimization helps.
14 rank inversions. After optimization, local models beat API models in 14 of 36 possible matchups. The most dramatic: mistral classification (54pp swing past gpt-4o).
MIPROv2's all-or-nothing demo selection is suboptimal. It always chose 0 or 5 demos, never intermediate. Phase 3 proved phi4 peaks at n=1-2 on most tasks. A granular strategy could recover 2-8pp.
Two failure modes. Instruction damage (optimized instructions hurt: phi4 rag -8pp) and demo interference (too many examples: phi4 analysis inverted-U peaking at n=2). These are independent and can compound.
Optimization response is task-specific. Tasks explain 2x more variance than models (ratio 0.50). Code always 0pp, math always big gains — regardless of model.
Demo count optima are model-specific. On math, llama3.2 needs all 5 demos (monotonic), phi4 gets +20pp from demo #1 alone. Same task, different optimal strategy.
Extra compute per inference doesn't help (Phase 6a). Best-of-3, Majority Class Confidence, and Iterative Refinement showed no statistically significant improvement over vanilla CoT on any tested task (all McNemar's p > 0.05, n=200).
Two confirmed significant findings (Bonferroni-corrected, n=200). Mistral + MIPROv2 on Agentic: +13.5pp (p<0.0001). GPT-5.2 + MIPROv2 on Synthesis: +12.0pp (p<0.0001). All other unified experiment results were within noise.