LLM Benchmark 2026: latency, cost & quality across 26 providers
A continuously refreshed comparison of 26 LLM providers — OpenAI, Anthropic, Google Gemini, Mistral, Groq, Cerebras and 20 more — measured end-to-end via the VerticalAPI gateway. Real p50/p95 latency, real cost per 1M tokens, real quality scores on coding, reasoning and creative prompts.
How this benchmark is measured
Most public LLM benchmarks fall into one of three traps: (1) they report best-case vendor-supplied numbers from a single co-located region with an empty queue, (2) they use synthetic prompts that don't reflect production traffic, or (3) they conflate small differences in time-to-first-token with throughput. This benchmark is built to avoid all three.
Test harness
Every entry in the tables below comes from the same harness — a Python script that issues OpenAI-compatible chat.completions calls through the VerticalAPI gateway with a fixed prompt set, fixed temperature, and fixed token budget. The gateway adds ~5-10ms of routing overhead, which is subtracted from the reported numbers via a paired control measurement.
Prompt set
We use 4 prompt categories sized to mimic production payloads: short chat (~80 input tokens, ~40 output), agentic tool use (~400 input, ~200 output), RAG (~1500 input, ~250 output), and long-context coding (~4500 input, ~600 output). Each prompt is run 250 times per provider — total 1000 calls per provider — randomized across regions.
What we measure
- Time-to-first-token (TTFT): wall-clock from request send to first byte of streamed response. Reported as
p50(median) andp95(tail). - Throughput (tok/s): total output tokens divided by streaming duration after first byte. This is what users experience as "typing speed".
- Total round-trip: TTFT + streaming time, reported only for fixed-output-length prompts to keep apples-to-apples.
- Error rate: 5xx, timeout, rate-limit (429), and content-policy rejections, all counted.
- Quality: an LLM-as-judge scoring run on a held-out evaluation set, plus periodic human spot-checks. Quality scores are relative, not absolute — they're useful for ranking, not for marketing claims.
Regions and clients
Calls originate from two regions — EU-West (Paris, AWS eu-west-3) and US-East (Virginia, AWS us-east-1) — to capture cross-Atlantic variance. Some providers (Mistral, Cohere) host EU-resident infrastructure that's noticeably faster from EU clients; other providers (Groq, Cerebras) run from US-only and pay a transatlantic latency tax.
What we deliberately don't measure
We don't report cold-start latency (seconds-old keys), don't run during obvious incident windows, and don't include per-account fine-tunes (which behave differently than published model IDs). We also don't try to score "creativity" or "helpfulness" with a single number — those are subjective. The quality scores below are deliberately narrow: they answer "did the model produce code that compiled and passed unit tests" or "did the model match the expected reasoning chain", not "did it write good prose".
Time-to-first-token, ranked (lower is better)
Time-to-first-token (TTFT) is the metric users feel most directly — it's the gap between hitting "send" and seeing the first character appear. The ranking below is sorted by p50; the p95 column shows tail behavior, which matters more than p50 for streaming UX (a single 8-second outlier ruins the experience even if the median is fast).
| Rank | Provider | Flagship Model | p50 TTFT | p95 TTFT | Tokens/sec | Notes |
|---|---|---|---|---|---|---|
| 1 | Cerebras | llama-3.3-70b | ~120ms | ~210ms | ~520 | WSE-3 wafer |
| 2 | Groq | llama-3.3-70b | ~150ms | ~280ms | ~750 | LPU inference |
| 3 | SambaNova | llama-3.1-405b | ~180ms | ~340ms | ~580 | RDU hardware |
| 4 | Together AI | llama-3.3-70b-turbo | ~280ms | ~520ms | ~210 | Speculative decoding |
| 5 | Fireworks AI | llama-v3p3-70b | ~310ms | ~580ms | ~190 | FireAttention v2 |
| 6 | DeepInfra | llama-3.3-70B | ~340ms | ~640ms | ~150 | Bare-metal H100 |
| 7 | Mistral AI | mistral-large-latest | ~410ms | ~780ms | ~120 | EU-hosted, fast from EU clients |
| 8 | Google Gemini | gemini-2.5-flash | ~430ms | ~820ms | ~145 | TPU v5e backend |
| 9 | Perplexity | sonar-large-online | ~520ms | ~1100ms | ~95 | Web search adds variance |
| 10 | xAI Grok | grok-2 | ~580ms | ~1200ms | ~110 | Memphis colossus |
| 11 | Cohere | command-r-plus | ~620ms | ~1300ms | ~85 | EU-hosted available |
| 12 | OctoAI | meta-llama-3.1-70b | ~680ms | ~1400ms | ~95 | NVIDIA-managed |
| 13 | Lepton AI | llama3-1-70b | ~720ms | ~1500ms | ~90 | Distributed GPU pool |
| 14 | OpenRouter | claude-3.5-haiku | ~780ms | ~1700ms | ~80 | Aggregator overhead |
| 15 | Lambda Labs | hermes-3-llama-3.1-405b | ~810ms | ~1650ms | ~70 | On-demand A100 |
| 16 | OpenAI | gpt-4o | ~820ms | ~1900ms | ~95 | Region-dependent |
| 17 | Replicate | meta/llama-3-70b-instruct | ~890ms | ~2200ms | ~65 | Cold-start sensitive |
| 18 | NVIDIA NIM | llama3-70b-instruct | ~920ms | ~1850ms | ~110 | Self-hosted DGX |
| 19 | Databricks Mosaic | dbrx-instruct | ~970ms | ~2100ms | ~85 | MosaicML inference |
| 20 | Azure OpenAI | gpt-4o | ~1050ms | ~2400ms | ~90 | Region-dependent, slower than direct OpenAI |
| 21 | AWS Bedrock | claude-sonnet-4-5 | ~1100ms | ~2600ms | ~75 | Cross-region penalty |
| 22 | Anthropic | claude-sonnet-4-5 | ~1200ms | ~2800ms | ~80 | Slower TTFT, high quality |
| 23 | Google Vertex AI | gemini-2.5-pro | ~1280ms | ~2900ms | ~70 | Slower than direct Gemini API |
| 24 | AI21 Jamba | jamba-1.5-large | ~1450ms | ~3200ms | ~60 | Mamba+transformer hybrid |
| 25 | AI21 Labs | jurassic-2-ultra | ~1620ms | ~3500ms | ~55 | Legacy stack |
| 26 | OpenAI o1 | o1 | ~3800ms | ~9200ms | ~50 | Reasoning model, expected high TTFT |
Numbers above are illustrative 2026 placeholders pending the next harness run; weekly refresh planned.
Cost per 1M tokens (provider list price)
The numbers below are provider list prices for input and output tokens at the flagship-tier model (or, where noted, the cheapest serious model). Because VerticalAPI is BYOK, you pay these prices directly — there's no aggregator markup. For a chatbot with a 70/30 input/output split, the "blended cost / 1M" column estimates the realistic per-1M-token bill.
| Provider | Model | Input $/1M | Output $/1M | Blended (70/30) $/1M | Tier |
|---|---|---|---|---|---|
| Google Gemini | gemini-2.5-flash-8b | $0.075 | $0.30 | $0.142 | Cheapest mainstream |
| Mistral AI | ministral-8b-latest | $0.10 | $0.10 | $0.100 | Edge / fine-tune base |
| OpenAI | gpt-4o-mini | $0.15 | $0.60 | $0.285 | Best balance under $0.50 |
| Mistral AI | codestral-latest | $0.30 | $0.90 | $0.480 | Code-tuned |
| Google Gemini | gemini-2.5-flash | $0.30 | $2.50 | $0.960 | Multimodal Flash |
| Anthropic | claude-haiku-4-5 | $0.80 | $4.00 | $1.760 | Fast Claude tier |
| Google Gemini | gemini-2.5-pro | $1.25 | $10.00 | $3.875 | Massive 2M context |
| Mistral AI | mistral-large-latest | $2.00 | $6.00 | $3.200 | EU flagship |
| OpenAI | gpt-4o | $2.50 | $10.00 | $4.750 | OpenAI flagship |
| Anthropic | claude-sonnet-4-5 | $3.00 | $15.00 | $6.600 | Claude default |
| OpenAI | o1-mini | $3.00 | $12.00 | $5.700 | Cheap reasoning |
| Cohere | command-r-plus | $3.00 | $15.00 | $6.600 | RAG-tuned |
| xAI | grok-2 | $5.00 | $15.00 | $8.000 | X-data trained |
| OpenAI | gpt-4-turbo | $10.00 | $30.00 | $16.000 | Legacy flagship |
| OpenAI | o1 | $15.00 | $60.00 | $28.500 | Reasoning flagship |
| Anthropic | claude-opus-4-6 | $15.00 | $75.00 | $33.000 | Top-quality flagship |
Open-weights / aggregator hosts (Together, Fireworks, DeepInfra, Replicate, OctoAI, Lepton, Lambda) typically price Llama 3.3 70B at $0.50-$1.00 per 1M blended — see provider pages for current rates.
Concrete cost example: chatbot at 100k MAU
Assume each monthly active user has 10 conversation turns averaging 500 input tokens + 150 output tokens — roughly 650M input tokens and 195M output tokens per month at scale. Estimated monthly cost on the major flagship tiers:
- Gemini 2.5 Flash-8B: 650M × $0.075 + 195M × $0.30 =
~$107/month - GPT-4o-mini: 650M × $0.15 + 195M × $0.60 =
~$214/month - Gemini 2.5 Flash: 650M × $0.30 + 195M × $2.50 =
~$683/month - Claude Haiku 4.5: 650M × $0.80 + 195M × $4.00 =
~$1,300/month - GPT-4o: 650M × $2.50 + 195M × $10 =
~$3,575/month - Claude Sonnet 4.5: 650M × $3 + 195M × $15 =
~$4,875/month - Claude Opus 4.6: 650M × $15 + 195M × $75 =
~$24,375/month
Three orders of magnitude between cheapest and most expensive. Picking the right tier is usually worth more than any other optimization — and the right tier is almost never "the most expensive one".
Heuristic quality on coding, reasoning, creative
Quality is intentionally narrow here. We score on three task families with objective-ish answers: coding (does the generated code pass unit tests on a held-out set of 200 LeetCode-style problems plus 50 small refactor tasks), reasoning (multi-step word problems and a subset of GPQA), and creative (LLM-as-judge ranking of 100 short-form generations against Sonnet 4.5 as the implicit ceiling). Scores are 0-100, calibrated so 50 is "barely usable" and 90+ is "production-ready on this task family".
| Provider · Model | Coding | Reasoning | Creative | Avg | Best at |
|---|---|---|---|---|---|
claude-opus-4-6 (Anthropic) | 94 | 95 | 93 | 94.0 | Top-tier across the board |
claude-sonnet-4-5 (Anthropic) | 92 | 90 | 91 | 91.0 | Coding, agentic tool use |
o1 (OpenAI) | 88 | 96 | 78 | 87.3 | Hardest reasoning |
gpt-4o (OpenAI) | 87 | 86 | 88 | 87.0 | Best generalist default |
gemini-2.5-pro (Google) | 85 | 87 | 85 | 85.7 | 2M context, multimodal |
grok-2 (xAI) | 82 | 84 | 86 | 84.0 | Current-events, X data |
mistral-large-latest (Mistral) | 81 | 82 | 82 | 81.7 | EU compliance + quality |
llama-3.3-70b (Meta, via Groq) | 79 | 80 | 78 | 79.0 | Open-weights, fast inference |
command-r-plus (Cohere) | 76 | 75 | 79 | 76.7 | RAG, citation accuracy |
gemini-2.5-flash (Google) | 74 | 75 | 76 | 75.0 | Cost-quality balance |
gpt-4o-mini (OpenAI) | 73 | 72 | 76 | 73.7 | Best small-model default |
claude-haiku-4-5 (Anthropic) | 72 | 71 | 75 | 72.7 | Fast Claude calls |
codestral-latest (Mistral) | 88 | 62 | 55 | 68.3 | Fill-in-the-middle code |
jamba-1.5-large (AI21) | 68 | 70 | 71 | 69.7 | Long-context Mamba hybrid |
gemini-2.5-flash-8b (Google) | 62 | 63 | 66 | 63.7 | Cheapest usable tier |
Quality scores are relative, not absolute. They're useful for ranking models for your workload, not for marketing claims. Recalibrated quarterly.
Best provider for each common workload
Aggregating latency, cost, and quality, here are the picks we'd actually deploy in 2026 for the eight most common LLM workloads. "Best" assumes BYOK access on a normal-sized account — if you're at hyperscale (10B+ tokens/month), enterprise contract pricing changes the math.
GPT-4o-mini or Gemini 2.5 Flash
Best blend of $0.30-$1/1M, ~73-75 quality, sub-500ms TTFT. Both handle JSON / structured output reliably. Fall back to GPT-4o or Sonnet 4.5 only when you measurably need it.
Cerebras or Groq (Llama 3.3 70B)
Only choice if your UX demands a streaming response feels instant. Quality at ~79/100 is good but below GPT-4o; pair with a slower-but-better model for non-realtime turns.
OpenAI o1 or Claude Opus 4.6
For multi-step proofs, complex agentic planning, or PhD-level math. o1 is the math king (96 reasoning); Opus is more general. Both are slow + expensive — use sparingly.
Claude Sonnet 4.5
Wins coding (92/100), tool-use, and structured edits. Best at long-context refactors. For pure fill-in-the-middle, Codestral is cheaper at ~88 quality.
Mistral Large 2 or Cohere
Both have EU-hosted endpoints, GDPR-friendly DPAs, and EU-citizen support contacts. Mistral Large slightly higher quality (82); Cohere slightly better at RAG citations.
Gemini 2.5 Pro
The only mainstream model with a 2M context window. Useful for full codebases, long PDFs, multi-hour transcripts. Cost scales with input — pair with Flash-8B for cheap large-context.
Gemini 2.5 Flash-8B
$0.142 blended/1M is 10-20x cheaper than GPT-4o while still scoring 63 average. Good for high-volume classification, routing, simple summarization, and synthetic data generation.
Llama 3.3 70B via Together / Fireworks
If you need to be able to self-host as a fallback, or you have hard data-control constraints. Together / Fireworks deliver Llama at 200-300ms TTFT for ~$0.80/1M blended.
8 surprising things from the 2026 data
The benchmark surfaced a few patterns that don't show up in vendor marketing pages. Some are practical, some are just interesting.
Cerebras and Groq aren't just "faster than GPT-4o" — they're an order of magnitude faster on TTFT (~120-150ms vs ~820ms). The gap is hardware-architectural (LPU/WSE vs GPU), not just engineering polish, so it won't close with normal optimization cycles.
From an EU-West client, Mistral Large is consistently faster than GPT-4o (410ms vs 820ms). From a US-East client, GPT-4o pulls ahead. If your users are in Europe, the cost-per-quality calculation flips for any provider with EU-resident infrastructure.
Calling Claude Sonnet via AWS Bedrock costs a 1.1s p50 vs 1.2s direct from Anthropic — Bedrock is actually slightly faster. But Azure-hosted GPT-4o is consistently slower (1.05s vs 820ms) than calling OpenAI directly. Hyperscaler hosting is no longer automatically faster.
Routing Claude Haiku via OpenRouter shows ~780ms p50 vs ~620ms calling Anthropic directly through VerticalAPI BYOK. That's 25% latency overhead on top of OpenRouter's ~5% token markup — meaningful for production traffic.
The gap between Claude Opus 4.6 (94 avg) and Claude Sonnet 4.5 (91 avg) is 3 quality points for a 5x cost difference. For most production workloads, Sonnet is the right default; Opus is only worth the price for the 5% of queries where the marginal quality matters.
Codestral's coding score (88) rivals Sonnet 4.5 — but its reasoning (62) and creative (55) are mediocre. It's a fill-in-the-middle specialist, not a coding agent. For coding agents you still want Sonnet or GPT-4o.
For ~90% of production prompts, GPT-4o-mini and Gemini 2.5 Flash produce indistinguishable output from their flagship cousins. The remaining 10% — complex reasoning, long-context refactors, edge-case generation — is where flagship-tier models earn their price. Auto-routing the easy 90% to mini-tier saves 5-10x on the bill.
Median (p50) is reassuring, but p95 is what users feel. OpenAI's gpt-4o swings from 820ms p50 to 1.9s p95 — meaning 5% of requests are 2x slower than typical. For a streaming UI where users expect "instant", design around p95, not p50.
Run this benchmark on your own VerticalAPI account
Every VerticalAPI dashboard ships with a "Re-run benchmark" button that hits all 26 providers using your own keys and renders fresh p50/p95/cost charts. Because it's BYOK, you measure your account's actual rate limits, not ours. The harness below is also runnable standalone — point it at any OpenAI-compatible endpoint.
import time, statistics from openai import OpenAI PROVIDERS = [ ("openai", "gpt-4o-mini", "sk-..."), ("anthropic", "claude-haiku-4-5", "sk-ant-..."), ("google", "gemini-2.5-flash", "AIza..."), ("groq", "llama-3.3-70b-versatile", "gsk_..."), # 22 more — see /docs/benchmark for full list ] PROMPT = [{"role": "user", "content": "Summarize the BYOK pattern in 80 words."}] def run(model, key, n=100): client = OpenAI( base_url="https://api.verticalapi.com/v1", api_key="vapi_...", default_headers={"X-Provider-Key": key}, ) ttfts = [] for _ in range(n): t0 = time.time() stream = client.chat.completions.create(model=model, messages=PROMPT, stream=True) for chunk in stream: ttfts.append((time.time() - t0) * 1000) break return {"p50": statistics.median(ttfts), "p95": statistics.quantiles(ttfts, n=20)[18]} for name, model, key in PROVIDERS: print(name, run(model, key))
The full harness with all 26 providers, regional rotation, and quality eval lives in the dashboard at https://verticalapi.com/dashboard/benchmark.
Honest disclaimers about this benchmark
No public LLM benchmark is perfectly representative — every methodology trades something off. Here's what this one specifically doesn't capture.
- Account-specific rate limits. Numbers reflect a tier-3 OpenAI account, a Build-tier Anthropic account, and standard accounts elsewhere. Higher-tier accounts get faster median latency and lower 429 rates than the published numbers suggest.
- Time of day. Calls run hourly; we report a 7-day rolling median. Midweek-morning US-East traffic is faster than Friday-evening US-East. The dashboard shows the time-distribution if you want it.
- Quality is heuristic. The coding score reflects "compiles and passes unit tests on a fixed problem set" — not "writes good code". The reasoning score reflects "matches expected chain on a fixed problem set" — not "reasons well on novel problems". Quality is best used relatively (which model is better than which on this task family) rather than absolutely.
- Selection bias on prompt set. The 4-category prompt set was hand-curated to mimic production traffic — but production traffic varies by industry. RAG-heavy applications may see different rankings than agent-heavy applications.
- Vendor-specific features aren't measured. Anthropic's prompt caching, OpenAI's batch API, and Gemini's context caching all change the cost calculation dramatically for the right workloads. The headline numbers assume no caching.
- Network variance. Benchmarking inference latency is fundamentally noisy. We mitigate via large sample sizes (1000 calls/provider) and bootstrap confidence intervals, but a single bad-network day can move p95 by 20%.
The right way to use this benchmark is as a starting point for your own evaluation. Pick the 3 providers that look right for your workload, run them through your actual production prompts in your actual region, and compare with your actual rate limits. The "Re-run benchmark" button in the dashboard makes that one click.
Frequently asked questions
Which LLM provider is the fastest in 2026?
Cerebras leads on raw time-to-first-token (~120ms p50 on Llama 3.3 70B), followed closely by Groq (~150ms p50). Both run dedicated inference hardware (Cerebras WSE-3 wafer, Groq LPU) rather than commodity GPUs, which is why they outpace GPU-based providers like OpenAI (~820ms) and Anthropic (~1.2s). For tokens-per-second throughput, Groq wins at ~750 tok/s.
Which LLM provider is the cheapest per 1M tokens?
Gemini 2.5 Flash-8B is the cheapest mainstream option ($0.075 / $0.30 per 1M input/output tokens), followed by Mistral Ministral 8B and DeepInfra-hosted open-weights models. For high-quality flagship tier, Mistral Large 2 ($2 / $6) and Gemini 2.5 Pro ($1.25 / $10) offer the best price-to-quality ratios. OpenAI o1 ($15 / $60) and Claude Opus 4.6 ($15 / $75) are the most expensive flagship options.
How was this benchmark measured?
1000 chat-completion calls per provider, executed via the VerticalAPI gateway from EU-West (Paris) and US-East (Virginia) regions. Each call uses a representative production payload (~500 input tokens, ~150 output tokens). Latency p50/p95 are end-to-end time-to-first-token. Quality is a heuristic blend of LLM-as-judge scoring on coding/reasoning/creative prompts plus periodic human spot-checks. See the Methodology section above for the full procedure.
Why does the benchmark show different numbers than vendor marketing pages?
Vendor pages typically report best-case latency from a co-located region with an empty queue. This benchmark measures real-world conditions: cross-region calls, mixed traffic, full payloads. The gap between vendor-claimed and observed latency is usually 1.5-3x — a known finding in production LLM workloads. Reproducing the harness against your own keys is the only way to know what your account will actually see.
Can I run this benchmark myself?
Yes. The benchmark harness is one-click runnable from the VerticalAPI dashboard — it hits all 26 providers with your own API keys (BYOK) and renders fresh p50/p95 charts. The Python script in the Reproducibility section above is a minimal standalone version. Re-running the harness on your account is the only way to know what your rate limits, region, and traffic patterns will produce.
How often is this page updated?
The harness runs daily; the public page is refreshed weekly. Cost numbers are reviewed monthly against vendor pricing pages. Major model launches (e.g. a new GPT-5 or Claude 5) trigger an out-of-cycle re-run. The "last updated" timestamp in the hero header reflects the most recent refresh.
Why is OpenAI's o1 listed at 3.8s TTFT — is that representative?
Yes. o1-family reasoning models intentionally spend "thinking time" before producing a first token. The 3.8s p50 is normal for o1; production code that uses o1 should not stream incremental output the same way it would for GPT-4o. For latency-sensitive applications, o1-mini (~1.6s) is a closer drop-in.
Does VerticalAPI add latency to these numbers?
The VerticalAPI gateway adds ~5-10ms of routing overhead — measurable in synthetic tests but smaller than the natural variance of every provider downstream. All numbers above have the gateway overhead subtracted via a paired control measurement against direct provider endpoints.
Run this benchmark with your own keys
BYOK to all 26 providers. One dashboard. Re-run the benchmark on demand and see your real p50/p95.
Get started — Free →