Full research write-up — hypothesis, methodology, results across 5 experimental phases, and key findings.
Intelligence Arbitrage is a systematic experiment testing whether small, locally-running language models (2–14B parameters), when optimized with DSPy's MIPROv2 prompt optimizer, can match the performance of frontier API models (GPT-4o, GPT-5.2).
The core idea: frontier models are priced uniformly across all tasks, but many production tasks don't require frontier intelligence. If optimization can close the gap on specific tasks, teams can route work to cheap local models and pay frontier prices only where they're genuinely necessary.
DSPy's optimization pipeline (teacher bootstrapping + MIPROv2) can close the performance gap between local open-weight models (3B–14B parameters) and frontier API models (GPT-4o, GPT-5.2) on structured NLP tasks.
The "arbitrage" framing: if a $0/eval local model can match a $0.01+/eval API model on a specific task after optimization, you capture the accuracy without paying the inference cost. The larger the performance gap that optimization closes, the larger the arbitrage opportunity.
| Model | Params | Avg Baseline | Avg Optimized | Avg Delta | Tasks Improved |
|---|---|---|---|---|---|
| qwen2.5:7b | 7B | 51.7% | 65.9% | +14.2pp | 8/9 |
| mistral | 7B | 42.1% | 55.7% | +13.6pp | 6/9 |
| llama3.2 | 3B | 49.7% | 60.2% | +10.5pp | 7/9 |
| gpt-5.2 | API | 68.0% | 77.1% | +9.1pp | 5/9 |
| phi4 | 14B | 56.1% | 65.0% | +8.8pp | 5/9 |
| gpt-4o | API | 63.3% | 71.2% | +7.9pp | 5/9 |
Local optimized vs GPT-5.2 baseline:
| Task | GPT-5.2 Baseline | Best Local Optimized | Verdict |
|---|---|---|---|
| extraction | 44% | phi4: 84% (+40pp) | Massive |
| qa | 22% | phi4: 30% (+8pp) | Strong |
| synthesis | 40% | mistral: 47% (+7pp) | Strong |
| rag | 88% | llama3.2: 90% (+2pp) | Matched |
| code | 90% | phi4: 90% (tied) | Matched |
| agentic | 86% | mistral: 84% (−2pp) | Near-match |
| analysis | 88% | qwen2.5: 84% (−4pp) | Close |
| math | 90% | phi4: 84% (−6pp) | Gap remains |
| classification | 64% | mistral: 54% (−10pp) | Gap remains |
| Domain | Dataset | Metric |
|---|---|---|
| Classification | Banking77 | Accuracy |
| Math | GSM8K | Exact Match |
| QA | HotPotQA | Exact Match |
| Extraction | CoNLL-2003 | F1 Score |
| Analysis | BoolQ | Accuracy |
| Synthesis | XSum | F1 Score (w/ copy penalty) |
| Agentic | StrategyQA | Accuracy |
| RAG | Custom (SQuAD) | Faithfulness |
| Code | MBPP → HumanEval | Pass@1 (cross-dataset) |
MIPROv2 optimization applied to 6 models across all 9 task domains. Each condition evaluated on n=50 examples.
Percentage-point change from optimization (positive = improved). Large positives (math, extraction, QA) represent genuine gains; deltas ≤±4pp are within the noise floor at n=50.
| Task | llama3.2 | mistral | phi4 | qwen2.5:7b | gpt-4o | gpt-5.2 |
|---|---|---|---|---|---|---|
| agentic | +12 | +28 | −8 | +6 | +4 | −4 |
| analysis | +10 | +5 | −4 | +6 | +11 | 0 |
| classification | +3 | +44 | −2 | +20 | −10 | −2 |
| code | 0 | 0 | 0 | 0 | 0 | −2 |
| extraction | +16 | −2 | +32 | +42 | +28 | +50 |
| math | +26 | +30 | +36 | +36 | +28 | +8 |
| qa | +20 | +18 | +22 | +12 | +20 | +20 |
| rag | +8 | −3 | +2 | +6 | −2 | +4 |
| synthesis | −0.3 | +2.4 | +1.4 | +0.1 | −7.6 | +8.3 |
14 cases where an optimized local model outperformed an optimized API model. Most dramatic examples:
| Task | Local (optimized) | API (optimized) | Total swing from baseline |
|---|---|---|---|
| classification | mistral 54% | gpt-4o 46% | 54pp swing (was 46pp behind) |
| agentic | mistral 84% | gpt-5.2 82% | 32pp swing (was 30pp behind) |
| code | phi4 90% | gpt-4o 76% | 28pp swing (was already 14pp ahead) |
| extraction | phi4 84% | gpt-4o 74% | 14pp swing |
| synthesis | all 4 locals | gpt-4o 35.7% | gpt-4o regressed −7.6pp, all locals beat it |
This means the task type predicts whether optimization will help more reliably than the model choice. Code always shows 0pp delta. Math always shows big gains. These patterns hold regardless of which model you use.
| Tier | Tasks | Cross-model std | Interpretation |
|---|---|---|---|
| Task-driven | code (0.7), qa (3.2), rag (4.3) | Low | All models respond similarly |
| Weakly task-driven | synthesis (4.8), analysis (5.0), math (9.8) | Medium | Some model-dependent variation |
| Model-dependent | classification (18.2), extraction (17.0), agentic (12.8) | High | Response depends heavily on the model |
MIPROv2 also showed an all-or-nothing demo count pattern: 30 programs used 5 demos, 24 used 0 demos. No intermediate counts were ever chosen — suggesting MIPROv2 uses a binary heuristic rather than fine-grained demo selection.
For 6 (model, task) pairs where Phase 1 showed negative or near-zero deltas, we progressively evaluated performance at n=0, 1, 2, 3, 4, 5 demonstrations — decomposing optimization into instruction effect and demo effect.
| Model | Task | Instruction Effect | Demo Effect | Diagnosis |
|---|---|---|---|---|
| phi4 | analysis | −4pp (hurt) | −2pp (hurt) | Both hurt — double penalty |
| phi4 | rag | −8pp (hurt) | +8pp (helped) | Instructions damaged, demos compensated |
| phi4 | math | +4pp (helped) | +30pp (helped) | Both helped |
| llama3.2 | math | +6pp (helped) | +26pp (helped) | Both helped, monotonic |
Demo count optima are also model-specific: on math, llama3.2 shows monotonic improvement through all 5 demos, while phi4 gets +20pp from demo #1 alone and shows diminishing returns after that.
Two reasoning models — qwen3:4b and deepseek-r1:7b — evaluated as unoptimized baselines across all 9 tasks.
| Strength area | qwen3:4b | deepseek-r1:7b | Winner |
|---|---|---|---|
| Math | 50% | 32% | qwen3:4b (+18pp) |
| RAG | 88% | 74% | qwen3:4b (+14pp) |
| Classification | 14% | 64% | deepseek-r1:7b (+50pp) |
| Extraction | 20% | 66% | deepseek-r1:7b (+46pp) |
| Analysis | 68% | 82% | deepseek-r1:7b (+14pp) |
deepseek-r1:7b's classification score (64%) matches GPT-5.2's baseline without any optimization. However, its per-example latency is 13–18 seconds vs qwen3:4b's ~95ms — a critical production tradeoff.
Tests: at what number of labeled examples does optimization start working? Uses BootstrapFewShotWithRandomSearch on qwen3:4b with GPT-4o as teacher.
| Task | Baseline | n=0 | n=1 | Result |
|---|---|---|---|---|
| math | 50% | 50% | 64% (+14pp) | Tipping point at n=1 |
| classification | 34% | 34% | 34% | Resists optimization entirely |
| qa | 12% | 12% | 10% | Optimization degraded (program failed) |
These three tasks represent the three possible outcomes of optimization: responsive (math jumps +14pp from a single example), resistant (classification: 3.7 hours, zero gain), and broken (QA program failed to compile at n=0 and n=1).
| Component | Technology | Role |
|---|---|---|
| Optimization | DSPy (MIPROv2) | Prompt instruction + demo optimization |
| Local inference | Ollama | Llama 3.2, Qwen 2.5, Mistral, Phi-4, DeepSeek-R1, Qwen3 |
| API baselines | OpenAI API | GPT-4o, GPT-5.2 |
| Teacher model | GPT-4o | Generates optimized instructions and demonstrations |
| Datasets | HuggingFace Datasets | Banking77, GSM8K, HotPotQA, CoNLL-2003, BoolQ, XSum, StrategyQA, MBPP, HumanEval |
| Data analysis | Pandas | Result aggregation and analysis |
| Dashboard | Chart.js + React | Interactive visualization |
The experiment results are visualized across three interactive pages: