Hypothesis-driven model routing research. We ran structured waves of experiments comparing local Ollama models against Claude Sonnet across task types — code generation, bug detection, repair, and structured output. Produced a validated routing table, a published paper draft, and two public blog posts. Active: Wave 7 in progress.
Problem
Local model routing needed evidence instead of vibes: which models can replace cloud models, and for which tasks?
Experiment
Run thousands of shadow tests across local models with independent judge evaluation and task-specific hypotheses.
Shipped artifact
A validated routing table, published analysis, and an active research program feeding the lab operating model.
Result
Ralph Lab turned model selection into a measurable system rather than a default-provider choice.
What we learned
Local models can win specific jobs when evaluation is task-specific, statistically grounded, and continuously updated.
Proof assets