Give an LLM an API and It'll Thrive. Give It a Touchscreen and It Struggles
Model Comparison
Lower is better for all metrics. Whiskers show 95% CI on fail rate. Tap/click a row for details.
How is this calculated?
Test Period
All runs were conducted between March 28 – April 2, 2026 using each provider's publicly available API.
Plans Used
Tests were run