Can DSPy-optimized small models match frontier APIs? Results from 175 evaluation runs across 8 models, 9 tasks, and 5 experimental phases.
5/9
Tasks where local models match GPT-5.2
14
Rank inversions (local > API after optimization)
+44pp
Largest single optimization lift (mistral, classification)
$0
Cost per local model evaluation
Phase 1: Optimization Delta Heatmap
Percentage-point change from MIPROv2 optimization across 6 models and 9 tasks. Green = improvement, red = regression.
Optimization Lift by Model
Average accuracy gain from MIPROv2 optimization. Smaller models benefit disproportionately.
Intelligence Arbitrage Scoreboard
GPT-5.2 baseline vs best optimized local model. Tasks sorted by arbitrage opportunity.
Rank Inversions
Cases where a $0 local model beat a paid API model after optimization — despite starting behind.
gpt-4o: The Optimization Victim
gpt-4o has 3 of 12 negative deltas — tied with phi4 and gpt-5.2 (3 each). Its regressions are the most severe in magnitude (-10pp classification, -7.6pp synthesis).
Phase 3: Demo Curve Experiment
How accuracy changes as we progressively add 0-5 few-shot demonstrations to MIPROv2 programs.
phi4 — Demo Interference
Analysis peaks at n=2, RAG at n=1. More demos actively hurt.
llama3.2 — Monotonic Benefit
Math improves with every demo. Synthesis already at ceiling.
Instruction Damage vs Demo Effect
Decomposing optimization into its two components: the effect of optimized instructions alone (n=0 vs baseline) and the added effect of demos (n=5 vs n=0).
Phase 2: Task-Specific vs Model-Specific
Within-task variance is half within-model variance — tasks determine WHETHER optimization helps, models determine HOW MUCH.
Cross-Model Standard Deviation by Task
Low std = all models agree. High std = model identity matters.
Gap Compression vs Divergence
Baseline range vs optimized range across models per task.
Phase 4: Reasoning Model Baselines
Two reasoning models show complementary profiles — qwen3:4b excels at math/rag, deepseek-r1:7b at classification/extraction.
Phase 5: Optimization Tipping Point
How many labeled examples are needed before optimization fires? BootstrapFewShotWithRandomSearch on qwen3:4b, n=50 test examples, GPT-4o teacher.
qwen3:4b — Baseline vs n=0 vs n=1 Optimized
Math jumps +14pp from a single example. Classification and QA optimization programs failed to compile — identical scores at every count.
Key finding: A single labeled example is the tipping point for math (+14pp). Classification resisted optimization entirely (3.7 hrs, zero gain). QA program failed to compile at n=0 and n=1. These three tasks represent the three possible outcomes: responsive, resistant, and broken.
Phase 6a: Module Comparison (CoT vs Best-of-3 vs MCC)
Does 3× inference compute (Best-of-3 sampling, Majority Class Confidence, Iterative Refinement) beat vanilla CoT? phi4 and llama3.2, 4 tasks, n=200. McNemar's exact test.
phi4
CoT is competitive across all tasks. No significant differences (all p > 0.05).
llama3.2
Same null result. Refine-3 shows no systematic advantage over a single CoT pass.
Null result: Paying 3× the inference cost via multi-sample methods does not reliably improve accuracy over a single CoT call for these tasks and model sizes. The best module per task varies randomly — no method dominates consistently.
Phase 6b: GEPA vs MIPROv2
GEPA (Gradient-Enhanced Prompt Adaptation) vs MIPROv2 vs CoT baseline on phi4, 4 tasks, n=200. Bonferroni-corrected α = 0.0025.
phi4 — Three-Way Optimizer Comparison
One statistically significant result: GEPA on RAG (+5pp vs CoT baseline, p=0.006). All other comparisons non-significant after Bonferroni correction.
One confirmed improvement: GEPA + phi4 on RAG: 83.5% → 88.5% (+5pp, p=0.006). This is the only case where GEPA outperformed CoT at statistical significance. MIPROv2 gained +2.5pp on RAG (p=0.27, not significant). All agentic and analysis results were flat or within noise.
Bright bars = Bonferroni-significant (★ p<0.0001). Most conditions show no significant improvement or change within noise.
★ Mistral + MIPROv2, Agentic: +13.5pp (p<0.0001) — 53.0% → 66.5%. Largest confirmed gain in the study. CI: [5.3, 21.7].
★ GPT-5.2 + MIPROv2, Synthesis: +12.0pp (p<0.0001) — 81.5% → 93.5%. CI: [4.4, 19.6]. Note: synthesis uses a different evaluation metric than Phase 1 (see Statistical Notes below).
Unified Experiment — Delta Heatmap (n=200)
Percentage-point change from MIPROv2 optimization. ★ = Bonferroni-significant. Dashes indicate no change.
Statistical Notes
Phase 1 noise floor (n=50): At n=50 binary outcomes, standard error ≈ 7pp. Deltas smaller than ±4pp are statistically indistinguishable from zero. Only large Phase 1 deltas (≥10pp) should be treated as directionally reliable. The unified experiment (n=200) cuts the noise floor to ~3.5pp.
DSPy version confound: Phase 1 baselines and llama3.2/mistral optimized programs used DSPy 2.6.x. phi4, qwen2.5, and gpt-5.2 optimized programs used DSPy 3.1.3. Baseline discrepancies of 2–4pp exist across versions and cannot be fully attributed to model differences alone.
Synthesis metric discontinuity: Phase 1 synthesis scores (40–49%) and unified experiment scores (81.5–98%) use different evaluation metrics. Direct comparison across phases is not valid for this task.
Key Findings
Intelligence arbitrage is real. Optimized local models (3-14B, $0/eval) match or exceed GPT-5.2 baseline on 5/9 tasks: extraction (+40pp), QA (+8pp), synthesis (+7pp), RAG (+2pp), and code (tied).
Optimization is a force multiplier for weaker models. qwen2.5:7b gained +14.2pp average (Phase 1, n=50), gpt-4o +7.9pp. The smaller the model, the more prompt optimization helps.
14 rank inversions. After optimization, local models beat API models in 14 of 36 possible matchups. The most dramatic: mistral classification (54pp swing past gpt-4o).
MIPROv2's all-or-nothing demo selection is suboptimal. It always chose 0 or 5 demos, never intermediate. Phase 3 proved phi4 peaks at n=1-2 on most tasks. A granular strategy could recover 2-8pp.
Two failure modes. Instruction damage (optimized instructions hurt: phi4 rag -8pp) and demo interference (too many examples: phi4 analysis inverted-U peaking at n=2). These are independent and can compound.
Optimization response is task-specific. Tasks explain 2x more variance than models (ratio 0.50). Code always 0pp, math always big gains — regardless of model.
Demo count optima are model-specific. On math, llama3.2 needs all 5 demos (monotonic), phi4 gets +20pp from demo #1 alone. Same task, different optimal strategy.
Extra compute per inference doesn't help (Phase 6a). Best-of-3, Majority Class Confidence, and Iterative Refinement showed no statistically significant improvement over vanilla CoT on any tested task (all McNemar's p > 0.05, n=200).
Two confirmed significant findings (Bonferroni-corrected, n=200). Mistral + MIPROv2 on Agentic: +13.5pp (p<0.0001). GPT-5.2 + MIPROv2 on Synthesis: +12.0pp (p<0.0001). All other unified experiment results were within noise.