Documentation

Full research write-up — hypothesis, methodology, results across 5 experimental phases, and key findings.

Project Overview

Intelligence Arbitrage is a systematic experiment testing whether small, locally-running language models (2–14B parameters), when optimized with DSPy's MIPROv2 prompt optimizer, can match the performance of frontier API models (GPT-4o, GPT-5.2).

The core idea: frontier models are priced uniformly across all tasks, but many production tasks don't require frontier intelligence. If optimization can close the gap on specific tasks, teams can route work to cheap local models and pay frontier prices only where they're genuinely necessary.

175
Evaluation runs
8
Models tested
9
Task domains
5
Experiment phases

Hypothesis

DSPy's optimization pipeline (teacher bootstrapping + MIPROv2) can close the performance gap between local open-weight models (3B–14B parameters) and frontier API models (GPT-4o, GPT-5.2) on structured NLP tasks.

The "arbitrage" framing: if a $0/eval local model can match a $0.01+/eval API model on a specific task after optimization, you capture the accuracy without paying the inference cost. The larger the performance gap that optimization closes, the larger the arbitrage opportunity.

Key Findings

Intelligence arbitrage is real. Optimized local models (3–14B, $0/eval) match or exceed GPT-5.2 baseline on 5 of 9 tasks: extraction (+40pp), QA (+8pp), synthesis (+7pp), RAG (+2pp), and code (tied).
  1. Optimization is a force multiplier for weaker models. qwen2.5:7b gained +14.2pp average (Phase 1, n=50), gpt-4o +7.9pp. The smaller the model, the more prompt optimization helps.
  2. 14 rank inversions. After optimization, local models beat API models in 14 of 36 possible matchups. The most dramatic: mistral classification (54pp swing past gpt-4o).
  3. MIPROv2's all-or-nothing demo selection is suboptimal. It always chose 0 or 5 demos, never intermediate. Phase 3 proved phi4 peaks at n=1–2 on most tasks. A granular strategy could recover 2–8pp.
  4. Two failure modes. Instruction damage (optimized instructions hurt: phi4 rag −8pp) and demo interference (too many examples: phi4 analysis inverted-U peaking at n=2). These are independent and can compound.
  5. Optimization response is task-specific. Tasks explain 2× more variance than models (ratio 0.50). Code always 0pp, math always big gains — regardless of model.
  6. Demo count optima are model-specific. On math, llama3.2 needs all 5 demos (monotonic), phi4 gets +20pp from demo #1 alone. Same task, different optimal strategy.
  7. Extra compute per inference doesn't help (Phase 6a). Best-of-3, Majority Class Confidence, and Iterative Refinement showed no statistically significant improvement over vanilla CoT (all McNemar's p > 0.05, n=200).
  8. Two confirmed significant findings (Bonferroni-corrected, n=200). Mistral + MIPROv2 on Agentic: +13.5pp (p<0.0001). GPT-5.2 + MIPROv2 on Synthesis: +12.0pp (p<0.0001). All other unified experiment results were within noise.

Optimization Lift by Model (Phase 1)

ModelParamsAvg BaselineAvg OptimizedAvg DeltaTasks Improved
qwen2.5:7b7B51.7%65.9%+14.2pp8/9
mistral7B42.1%55.7%+13.6pp6/9
llama3.23B49.7%60.2%+10.5pp7/9
gpt-5.2API68.0%77.1%+9.1pp5/9
phi414B56.1%65.0%+8.8pp5/9
gpt-4oAPI63.3%71.2%+7.9pp5/9

Intelligence Arbitrage Scoreboard

Local optimized vs GPT-5.2 baseline:

TaskGPT-5.2 BaselineBest Local OptimizedVerdict
extraction44%phi4: 84% (+40pp)Massive
qa22%phi4: 30% (+8pp)Strong
synthesis40%mistral: 47% (+7pp)Strong
rag88%llama3.2: 90% (+2pp)Matched
code90%phi4: 90% (tied)Matched
agentic86%mistral: 84% (−2pp)Near-match
analysis88%qwen2.5: 84% (−4pp)Close
math90%phi4: 84% (−6pp)Gap remains
classification64%mistral: 54% (−10pp)Gap remains

Experimental Setup

Task Domains

DomainDatasetMetric
ClassificationBanking77Accuracy
MathGSM8KExact Match
QAHotPotQAExact Match
ExtractionCoNLL-2003F1 Score
AnalysisBoolQAccuracy
SynthesisXSumF1 Score (w/ copy penalty)
AgenticStrategyQAAccuracy
RAGCustom (SQuAD)Faithfulness
CodeMBPP → HumanEvalPass@1 (cross-dataset)

Phase 1: Baseline vs Optimized (6 Models × 9 Tasks)

MIPROv2 optimization applied to 6 models across all 9 task domains. Each condition evaluated on n=50 examples.

Optimization Delta Matrix

Percentage-point change from optimization (positive = improved). Large positives (math, extraction, QA) represent genuine gains; deltas ≤±4pp are within the noise floor at n=50.

Taskllama3.2mistralphi4qwen2.5:7bgpt-4ogpt-5.2
agentic+12+28−8+6+4−4
analysis+10+5−4+6+110
classification+3+44−2+20−10−2
code00000−2
extraction+16−2+32+42+28+50
math+26+30+36+36+28+8
qa+20+18+22+12+20+20
rag+8−3+2+6−2+4
synthesis−0.3+2.4+1.4+0.1−7.6+8.3
Statistical note: At n=50, standard error ≈ 7pp. Deltas of ±4pp or smaller are not statistically distinguishable from zero. Only large deltas (≥10pp) should be treated as directionally reliable.

Rank Inversions

14 cases where an optimized local model outperformed an optimized API model. Most dramatic examples:

TaskLocal (optimized)API (optimized)Total swing from baseline
classificationmistral 54%gpt-4o 46%54pp swing (was 46pp behind)
agenticmistral 84%gpt-5.2 82%32pp swing (was 30pp behind)
codephi4 90%gpt-4o 76%28pp swing (was already 14pp ahead)
extractionphi4 84%gpt-4o 74%14pp swing
synthesisall 4 localsgpt-4o 35.7%gpt-4o regressed −7.6pp, all locals beat it

Phase 2: Variance Analysis — Task-Specific or Model-Specific?

Result: Tasks determine 2× more variance than models. Within-task variance across models (102.5) is half within-model variance across tasks (205.2). Variance ratio = 0.50.

This means the task type predicts whether optimization will help more reliably than the model choice. Code always shows 0pp delta. Math always shows big gains. These patterns hold regardless of which model you use.

Task Consistency Tiers

TierTasksCross-model stdInterpretation
Task-drivencode (0.7), qa (3.2), rag (4.3)LowAll models respond similarly
Weakly task-drivensynthesis (4.8), analysis (5.0), math (9.8)MediumSome model-dependent variation
Model-dependentclassification (18.2), extraction (17.0), agentic (12.8)HighResponse depends heavily on the model

MIPROv2 also showed an all-or-nothing demo count pattern: 30 programs used 5 demos, 24 used 0 demos. No intermediate counts were ever chosen — suggesting MIPROv2 uses a binary heuristic rather than fine-grained demo selection.

Phase 3: Demo Curve Experiment

For 6 (model, task) pairs where Phase 1 showed negative or near-zero deltas, we progressively evaluated performance at n=0, 1, 2, 3, 4, 5 demonstrations — decomposing optimization into instruction effect and demo effect.

Key Decomposition Results

ModelTaskInstruction EffectDemo EffectDiagnosis
phi4analysis−4pp (hurt)−2pp (hurt)Both hurt — double penalty
phi4rag−8pp (hurt)+8pp (helped)Instructions damaged, demos compensated
phi4math+4pp (helped)+30pp (helped)Both helped
llama3.2math+6pp (helped)+26pp (helped)Both helped, monotonic
Two distinct failure modes identified:
(1) Instruction damage — optimized instructions confuse the model (phi4 rag: −8pp from instructions alone).
(2) Demo interference — too many examples degrade performance (phi4 analysis: peaks at n=2, falls to 76% at n=5).

Demo count optima are also model-specific: on math, llama3.2 shows monotonic improvement through all 5 demos, while phi4 gets +20pp from demo #1 alone and shows diminishing returns after that.

Phase 4: Reasoning Model Baselines

Two reasoning models — qwen3:4b and deepseek-r1:7b — evaluated as unoptimized baselines across all 9 tasks.

Complementary Profiles

Strength areaqwen3:4bdeepseek-r1:7bWinner
Math50%32%qwen3:4b (+18pp)
RAG88%74%qwen3:4b (+14pp)
Classification14%64%deepseek-r1:7b (+50pp)
Extraction20%66%deepseek-r1:7b (+46pp)
Analysis68%82%deepseek-r1:7b (+14pp)

deepseek-r1:7b's classification score (64%) matches GPT-5.2's baseline without any optimization. However, its per-example latency is 13–18 seconds vs qwen3:4b's ~95ms — a critical production tradeoff.

Phase 5: Optimization Tipping Point

Tests: at what number of labeled examples does optimization start working? Uses BootstrapFewShotWithRandomSearch on qwen3:4b with GPT-4o as teacher.

TaskBaselinen=0n=1Result
math50%50%64% (+14pp)Tipping point at n=1
classification34%34%34%Resists optimization entirely
qa12%12%10%Optimization degraded (program failed)

These three tasks represent the three possible outcomes of optimization: responsive (math jumps +14pp from a single example), resistant (classification: 3.7 hours, zero gain), and broken (QA program failed to compile at n=0 and n=1).

Methodology & Limitations

Optimization Protocol

Statistical Limitations

Phase 1 noise floor (n=50): Standard error ≈ 7pp. Deltas smaller than ±4pp are statistically indistinguishable from zero. Only large deltas (≥10pp) should be treated as directionally reliable. The unified experiment (n=200) cuts the noise floor to ~3.5pp.
DSPy version confound: Phase 1 results span DSPy 2.6.x (llama3.2, mistral) and DSPy 3.1.3 (phi4, qwen2.5, gpt-5.2). Baseline discrepancies of 2–4pp exist across versions and cannot be fully attributed to model differences alone.
Synthesis metric discontinuity: Phase 1 synthesis scores (40–49%) and unified experiment scores (81.5–98%) use different evaluation metrics. Direct comparison across phases is not valid for this task.

Two Confirmed Significant Findings (Unified Experiment, n=200)

★ Mistral + MIPROv2, Agentic: +13.5pp (p<0.0001) — 53.0% → 66.5%. Largest confirmed gain in the study. CI: [5.3, 21.7].
★ GPT-5.2 + MIPROv2, Synthesis: +12.0pp (p<0.0001) — 81.5% → 93.5%. CI: [4.4, 19.6]. Note: uses different metric than Phase 1.

Technical Stack

ComponentTechnologyRole
OptimizationDSPy (MIPROv2)Prompt instruction + demo optimization
Local inferenceOllamaLlama 3.2, Qwen 2.5, Mistral, Phi-4, DeepSeek-R1, Qwen3
API baselinesOpenAI APIGPT-4o, GPT-5.2
Teacher modelGPT-4oGenerates optimized instructions and demonstrations
DatasetsHuggingFace DatasetsBanking77, GSM8K, HotPotQA, CoNLL-2003, BoolQ, XSum, StrategyQA, MBPP, HumanEval
Data analysisPandasResult aggregation and analysis
DashboardChart.js + ReactInteractive visualization

Optimization Loop

  1. Teacher Bootstrapping: GPT-4o generates high-quality few-shot examples for the target task
  2. MIPROv2: Optimizes task instructions and selects demonstrations for the student model
  3. Variance Reduction: 3 seeds (10, 20, 30); best-of-3 selected by dev set accuracy

Interactive Dashboards

The experiment results are visualized across three interactive pages: