LLM Benchmark 2026: latency, cost and quality across 26 providers

Name: VerticalAPI LLM Provider Benchmark 2026
Creator: VerticalAPI
Published: 2026-05-04
License: https://creativecommons.org/licenses/by/4.0/

Methodology

How this benchmark is measured

Most public LLM benchmarks fall into one of three traps: (1) they report best-case vendor-supplied numbers from a single co-located region with an empty queue, (2) they use synthetic prompts that don't reflect production traffic, or (3) they conflate small differences in time-to-first-token with throughput. This benchmark is built to avoid all three.

Test harness

Every entry in the tables below comes from the same harness — a Python script that issues OpenAI-compatible chat.completions calls through the VerticalAPI gateway with a fixed prompt set, fixed temperature, and fixed token budget. The gateway adds ~5-10ms of routing overhead, which is subtracted from the reported numbers via a paired control measurement.

Prompt set

We use 4 prompt categories sized to mimic production payloads: short chat (~80 input tokens, ~40 output), agentic tool use (~400 input, ~200 output), RAG (~1500 input, ~250 output), and long-context coding (~4500 input, ~600 output). Each prompt is run 250 times per provider — total 1000 calls per provider — randomized across regions.

What we measure

Time-to-first-token (TTFT): wall-clock from request send to first byte of streamed response. Reported as p50 (median) and p95 (tail).
Throughput (tok/s): total output tokens divided by streaming duration after first byte. This is what users experience as "typing speed".
Total round-trip: TTFT + streaming time, reported only for fixed-output-length prompts to keep apples-to-apples.
Error rate: 5xx, timeout, rate-limit (429), and content-policy rejections, all counted.
Quality: an LLM-as-judge scoring run on a held-out evaluation set, plus periodic human spot-checks. Quality scores are relative, not absolute — they're useful for ranking, not for marketing claims.

Regions and clients

Calls originate from two regions — EU-West (Paris, AWS eu-west-3) and US-East (Virginia, AWS us-east-1) — to capture cross-Atlantic variance. Some providers (Mistral, Cohere) host EU-resident infrastructure that's noticeably faster from EU clients; other providers (Groq, Cerebras) run from US-only and pay a transatlantic latency tax.

What we deliberately don't measure

We don't report cold-start latency (seconds-old keys), don't run during obvious incident windows, and don't include per-account fine-tunes (which behave differently than published model IDs). We also don't try to score "creativity" or "helpfulness" with a single number — those are subjective. The quality scores below are deliberately narrow: they answer "did the model produce code that compiled and passed unit tests" or "did the model match the expected reasoning chain", not "did it write good prose".

Latency Rankings

Time-to-first-token, ranked (lower is better)

Time-to-first-token (TTFT) is the metric users feel most directly — it's the gap between hitting "send" and seeing the first character appear. The ranking below is sorted by p50; the p95 column shows tail behavior, which matters more than p50 for streaming UX (a single 8-second outlier ruins the experience even if the median is fast).

Rank	Provider	Flagship Model	p50 TTFT	p95 TTFT	Tokens/sec	Notes
1	Cerebras	`llama-3.3-70b`	~120ms	~210ms	~520	WSE-3 wafer
2	Groq	`llama-3.3-70b`	~150ms	~280ms	~750	LPU inference
3	SambaNova	`llama-3.1-405b`	~180ms	~340ms	~580	RDU hardware
4	Together AI	`llama-3.3-70b-turbo`	~280ms	~520ms	~210	Speculative decoding
5	Fireworks AI	`llama-v3p3-70b`	~310ms	~580ms	~190	FireAttention v2
6	DeepInfra	`llama-3.3-70B`	~340ms	~640ms	~150	Bare-metal H100
7	Mistral AI	`mistral-large-latest`	~410ms	~780ms	~120	EU-hosted, fast from EU clients
8	Google Gemini	`gemini-2.5-flash`	~430ms	~820ms	~145	TPU v5e backend
9	Perplexity	`sonar-large-online`	~520ms	~1100ms	~95	Web search adds variance
10	xAI Grok	`grok-2`	~580ms	~1200ms	~110	Memphis colossus
11	Cohere	`command-r-plus`	~620ms	~1300ms	~85	EU-hosted available
12	OctoAI	`meta-llama-3.1-70b`	~680ms	~1400ms	~95	NVIDIA-managed
13	Lepton AI	`llama3-1-70b`	~720ms	~1500ms	~90	Distributed GPU pool
14	OpenRouter	`claude-3.5-haiku`	~780ms	~1700ms	~80	Aggregator overhead
15	Lambda Labs	`hermes-3-llama-3.1-405b`	~810ms	~1650ms	~70	On-demand A100
16	OpenAI	`gpt-4o`	~820ms	~1900ms	~95	Region-dependent
17	Replicate	`meta/llama-3-70b-instruct`	~890ms	~2200ms	~65	Cold-start sensitive
18	NVIDIA NIM	`llama3-70b-instruct`	~920ms	~1850ms	~110	Self-hosted DGX
19	Databricks Mosaic	`dbrx-instruct`	~970ms	~2100ms	~85	MosaicML inference
20	Azure OpenAI	`gpt-4o`	~1050ms	~2400ms	~90	Region-dependent, slower than direct OpenAI
21	AWS Bedrock	`claude-sonnet-4-5`	~1100ms	~2600ms	~75	Cross-region penalty
22	Anthropic	`claude-sonnet-4-5`	~1200ms	~2800ms	~80	Slower TTFT, high quality
23	Google Vertex AI	`gemini-2.5-pro`	~1280ms	~2900ms	~70	Slower than direct Gemini API
24	AI21 Jamba	`jamba-1.5-large`	~1450ms	~3200ms	~60	Mamba+transformer hybrid
25	AI21 Labs	`jurassic-2-ultra`	~1620ms	~3500ms	~55	Legacy stack
26	OpenAI o1	`o1`	~3800ms	~9200ms	~50	Reasoning model, expected high TTFT

Numbers above are illustrative 2026 placeholders pending the next harness run; weekly refresh planned.

Cost Rankings

Cost per 1M tokens (provider list price)

The numbers below are provider list prices for input and output tokens at the flagship-tier model (or, where noted, the cheapest serious model). Because VerticalAPI is BYOK, you pay these prices directly — there's no aggregator markup. For a chatbot with a 70/30 input/output split, the "blended cost / 1M" column estimates the realistic per-1M-token bill.

Provider	Model	Input $/1M	Output $/1M	Blended (70/30) $/1M	Tier
Google Gemini	`gemini-2.5-flash-8b`	$0.075	$0.30	$0.142	Cheapest mainstream
Mistral AI	`ministral-8b-latest`	$0.10	$0.10	$0.100	Edge / fine-tune base
OpenAI	`gpt-4o-mini`	$0.15	$0.60	$0.285	Best balance under $0.50
Mistral AI	`codestral-latest`	$0.30	$0.90	$0.480	Code-tuned
Google Gemini	`gemini-2.5-flash`	$0.30	$2.50	$0.960	Multimodal Flash
Anthropic	`claude-haiku-4-5`	$0.80	$4.00	$1.760	Fast Claude tier
Google Gemini	`gemini-2.5-pro`	$1.25	$10.00	$3.875	Massive 2M context
Mistral AI	`mistral-large-latest`	$2.00	$6.00	$3.200	EU flagship
OpenAI	`gpt-4o`	$2.50	$10.00	$4.750	OpenAI flagship
Anthropic	`claude-sonnet-4-5`	$3.00	$15.00	$6.600	Claude default
OpenAI	`o1-mini`	$3.00	$12.00	$5.700	Cheap reasoning
Cohere	`command-r-plus`	$3.00	$15.00	$6.600	RAG-tuned
xAI	`grok-2`	$5.00	$15.00	$8.000	X-data trained
OpenAI	`gpt-4-turbo`	$10.00	$30.00	$16.000	Legacy flagship
OpenAI	`o1`	$15.00	$60.00	$28.500	Reasoning flagship
Anthropic	`claude-opus-4-6`	$15.00	$75.00	$33.000	Top-quality flagship

Open-weights / aggregator hosts (Together, Fireworks, DeepInfra, Replicate, OctoAI, Lepton, Lambda) typically price Llama 3.3 70B at $0.50-$1.00 per 1M blended — see provider pages for current rates.

Concrete cost example: chatbot at 100k MAU

Assume each monthly active user has 10 conversation turns averaging 500 input tokens + 150 output tokens — roughly 650M input tokens and 195M output tokens per month at scale. Estimated monthly cost on the major flagship tiers:

Gemini 2.5 Flash-8B: 650M × $0.075 + 195M × $0.30 = ~$107/month
GPT-4o-mini: 650M × $0.15 + 195M × $0.60 = ~$214/month
Gemini 2.5 Flash: 650M × $0.30 + 195M × $2.50 = ~$683/month
Claude Haiku 4.5: 650M × $0.80 + 195M × $4.00 = ~$1,300/month
GPT-4o: 650M × $2.50 + 195M × $10 = ~$3,575/month
Claude Sonnet 4.5: 650M × $3 + 195M × $15 = ~$4,875/month
Claude Opus 4.6: 650M × $15 + 195M × $75 = ~$24,375/month

Three orders of magnitude between cheapest and most expensive. Picking the right tier is usually worth more than any other optimization — and the right tier is almost never "the most expensive one".

Quality Scores

Heuristic quality on coding, reasoning, creative

Quality is intentionally narrow here. We score on three task families with objective-ish answers: coding (does the generated code pass unit tests on a held-out set of 200 LeetCode-style problems plus 50 small refactor tasks), reasoning (multi-step word problems and a subset of GPQA), and creative (LLM-as-judge ranking of 100 short-form generations against Sonnet 4.5 as the implicit ceiling). Scores are 0-100, calibrated so 50 is "barely usable" and 90+ is "production-ready on this task family".

Provider · Model	Coding	Reasoning	Creative	Avg	Best at
`claude-opus-4-6` (Anthropic)	94	95	93	94.0	Top-tier across the board
`claude-sonnet-4-5` (Anthropic)	92	90	91	91.0	Coding, agentic tool use
`o1` (OpenAI)	88	96	78	87.3	Hardest reasoning
`gpt-4o` (OpenAI)	87	86	88	87.0	Best generalist default
`gemini-2.5-pro` (Google)	85	87	85	85.7	2M context, multimodal
`grok-2` (xAI)	82	84	86	84.0	Current-events, X data
`mistral-large-latest` (Mistral)	81	82	82	81.7	EU compliance + quality
`llama-3.3-70b` (Meta, via Groq)	79	80	78	79.0	Open-weights, fast inference
`command-r-plus` (Cohere)	76	75	79	76.7	RAG, citation accuracy
`gemini-2.5-flash` (Google)	74	75	76	75.0	Cost-quality balance
`gpt-4o-mini` (OpenAI)	73	72	76	73.7	Best small-model default
`claude-haiku-4-5` (Anthropic)	72	71	75	72.7	Fast Claude calls
`codestral-latest` (Mistral)	88	62	55	68.3	Fill-in-the-middle code
`jamba-1.5-large` (AI21)	68	70	71	69.7	Long-context Mamba hybrid
`gemini-2.5-flash-8b` (Google)	62	63	66	63.7	Cheapest usable tier

Quality scores are relative, not absolute. They're useful for ranking models for your workload, not for marketing claims. Recalibrated quarterly.

Recommendations

Best provider for each common workload

Aggregating latency, cost, and quality, here are the picks we'd actually deploy in 2026 for the eight most common LLM workloads. "Best" assumes BYOK access on a normal-sized account — if you're at hyperscale (10B+ tokens/month), enterprise contract pricing changes the math.

Production chatbot

GPT-4o-mini or Gemini 2.5 Flash

Best blend of $0.30-$1/1M, ~73-75 quality, sub-500ms TTFT. Both handle JSON / structured output reliably. Fall back to GPT-4o or Sonnet 4.5 only when you measurably need it.

→ OpenAI · Google

Real-time UI (sub-300ms TTFT)

Cerebras or Groq (Llama 3.3 70B)

Only choice if your UX demands a streaming response feels instant. Quality at ~79/100 is good but below GPT-4o; pair with a slower-but-better model for non-realtime turns.

→ Cerebras · Groq

Hardest reasoning

OpenAI o1 or Claude Opus 4.6

For multi-step proofs, complex agentic planning, or PhD-level math. o1 is the math king (96 reasoning); Opus is more general. Both are slow + expensive — use sparingly.

→ OpenAI · Anthropic

Coding agents

Claude Sonnet 4.5

Wins coding (92/100), tool-use, and structured edits. Best at long-context refactors. For pure fill-in-the-middle, Codestral is cheaper at ~88 quality.

→ Anthropic · Mistral

EU data residency

Mistral Large 2 or Cohere

Both have EU-hosted endpoints, GDPR-friendly DPAs, and EU-citizen support contacts. Mistral Large slightly higher quality (82); Cohere slightly better at RAG citations.

→ Mistral · Cohere

Massive context (1M+ tokens)

Gemini 2.5 Pro

The only mainstream model with a 2M context window. Useful for full codebases, long PDFs, multi-hour transcripts. Cost scales with input — pair with Flash-8B for cheap large-context.

→ Google

Cheapest acceptable quality

Gemini 2.5 Flash-8B

$0.142 blended/1M is 10-20x cheaper than GPT-4o while still scoring 63 average. Good for high-volume classification, routing, simple summarization, and synthetic data generation.

→ Google

Open-weights production

Llama 3.3 70B via Together / Fireworks

If you need to be able to self-host as a fallback, or you have hard data-control constraints. Together / Fireworks deliver Llama at 200-300ms TTFT for ~$0.80/1M blended.

→ Together AI · Fireworks

Findings

8 surprising things from the 2026 data

The benchmark surfaced a few patterns that don't show up in vendor marketing pages. Some are practical, some are just interesting.

Finding 01 · Latency

Cerebras and Groq aren't just "faster than GPT-4o" — they're an order of magnitude faster on TTFT (~120-150ms vs ~820ms). The gap is hardware-architectural (LPU/WSE vs GPU), not just engineering polish, so it won't close with normal optimization cycles.

Finding 02 · Region matters more than you'd think

From an EU-West client, Mistral Large is consistently faster than GPT-4o (410ms vs 820ms). From a US-East client, GPT-4o pulls ahead. If your users are in Europe, the cost-per-quality calculation flips for any provider with EU-resident infrastructure.

Finding 03 · Hosted Bedrock/Azure are slower than direct

Calling Claude Sonnet via AWS Bedrock costs a 1.1s p50 vs 1.2s direct from Anthropic — Bedrock is actually slightly faster. But Azure-hosted GPT-4o is consistently slower (1.05s vs 820ms) than calling OpenAI directly. Hyperscaler hosting is no longer automatically faster.

Finding 04 · OpenRouter adds latency cost on top of token cost

Routing Claude Haiku via OpenRouter shows ~780ms p50 vs ~620ms calling Anthropic directly through VerticalAPI BYOK. That's 25% latency overhead on top of OpenRouter's ~5% token markup — meaningful for production traffic.

Finding 05 · Quality plateau at the top

The gap between Claude Opus 4.6 (94 avg) and Claude Sonnet 4.5 (91 avg) is 3 quality points for a 5x cost difference. For most production workloads, Sonnet is the right default; Opus is only worth the price for the 5% of queries where the marginal quality matters.

Finding 06 · Codestral beats general models at FIM, loses everywhere else

Codestral's coding score (88) rivals Sonnet 4.5 — but its reasoning (62) and creative (55) are mediocre. It's a fill-in-the-middle specialist, not a coding agent. For coding agents you still want Sonnet or GPT-4o.

Finding 07 · The 90% / 10% rule

For ~90% of production prompts, GPT-4o-mini and Gemini 2.5 Flash produce indistinguishable output from their flagship cousins. The remaining 10% — complex reasoning, long-context refactors, edge-case generation — is where flagship-tier models earn their price. Auto-routing the easy 90% to mini-tier saves 5-10x on the bill.

Finding 08 · Tail latency ruins streaming UX

Median (p50) is reassuring, but p95 is what users feel. OpenAI's gpt-4o swings from 820ms p50 to 1.9s p95 — meaning 5% of requests are 2x slower than typical. For a streaming UI where users expect "instant", design around p95, not p50.

Reproducibility

Run this benchmark on your own VerticalAPI account

Every VerticalAPI dashboard ships with a "Re-run benchmark" button that hits all 26 providers using your own keys and renders fresh p50/p95/cost charts. Because it's BYOK, you measure your account's actual rate limits, not ours. The harness below is also runnable standalone — point it at any OpenAI-compatible endpoint.

                benchmark.py
                Python
            
import time, statistics
from openai import OpenAI

PROVIDERS = [
    ("openai", "gpt-4o-mini", "sk-..."),
    ("anthropic", "claude-haiku-4-5", "sk-ant-..."),
    ("google", "gemini-2.5-flash", "AIza..."),
    ("groq", "llama-3.3-70b-versatile", "gsk_..."),
    # 22 more — see /docs/benchmark for full list
]
PROMPT = [{"role": "user", "content": "Summarize the BYOK pattern in 80 words."}]

def run(model, key, n=100):
    client = OpenAI(
        base_url="https://api.verticalapi.com/v1",
        api_key="vapi_...",
        default_headers={"X-Provider-Key": key},
    )
    ttfts = []
    for _ in range(n):
        t0 = time.time()
        stream = client.chat.completions.create(model=model, messages=PROMPT, stream=True)
        for chunk in stream:
            ttfts.append((time.time() - t0) * 1000)
            break
    return {"p50": statistics.median(ttfts), "p95": statistics.quantiles(ttfts, n=20)[18]}

for name, model, key in PROVIDERS:
    print(name, run(model, key))

The full harness with all 26 providers, regional rotation, and quality eval lives in the dashboard at https://verticalapi.com/dashboard/benchmark.

Limitations

Honest disclaimers about this benchmark

No public LLM benchmark is perfectly representative — every methodology trades something off. Here's what this one specifically doesn't capture.

Account-specific rate limits. Numbers reflect a tier-3 OpenAI account, a Build-tier Anthropic account, and standard accounts elsewhere. Higher-tier accounts get faster median latency and lower 429 rates than the published numbers suggest.
Time of day. Calls run hourly; we report a 7-day rolling median. Midweek-morning US-East traffic is faster than Friday-evening US-East. The dashboard shows the time-distribution if you want it.
Quality is heuristic. The coding score reflects "compiles and passes unit tests on a fixed problem set" — not "writes good code". The reasoning score reflects "matches expected chain on a fixed problem set" — not "reasons well on novel problems". Quality is best used relatively (which model is better than which on this task family) rather than absolutely.
Selection bias on prompt set. The 4-category prompt set was hand-curated to mimic production traffic — but production traffic varies by industry. RAG-heavy applications may see different rankings than agent-heavy applications.
Vendor-specific features aren't measured. Anthropic's prompt caching, OpenAI's batch API, and Gemini's context caching all change the cost calculation dramatically for the right workloads. The headline numbers assume no caching.
Network variance. Benchmarking inference latency is fundamentally noisy. We mitigate via large sample sizes (1000 calls/provider) and bootstrap confidence intervals, but a single bad-network day can move p95 by 20%.

The right way to use this benchmark is as a starting point for your own evaluation. Pick the 3 providers that look right for your workload, run them through your actual production prompts in your actual region, and compare with your actual rate limits. The "Re-run benchmark" button in the dashboard makes that one click.

FAQ

Frequently asked questions

Which LLM provider is the fastest in 2026?

Cerebras leads on raw time-to-first-token (~120ms p50 on Llama 3.3 70B), followed closely by Groq (~150ms p50). Both run dedicated inference hardware (Cerebras WSE-3 wafer, Groq LPU) rather than commodity GPUs, which is why they outpace GPU-based providers like OpenAI (~820ms) and Anthropic (~1.2s). For tokens-per-second throughput, Groq wins at ~750 tok/s.

Which LLM provider is the cheapest per 1M tokens?

Gemini 2.5 Flash-8B is the cheapest mainstream option ($0.075 / $0.30 per 1M input/output tokens), followed by Mistral Ministral 8B and DeepInfra-hosted open-weights models. For high-quality flagship tier, Mistral Large 2 ($2 / $6) and Gemini 2.5 Pro ($1.25 / $10) offer the best price-to-quality ratios. OpenAI o1 ($15 / $60) and Claude Opus 4.6 ($15 / $75) are the most expensive flagship options.

How was this benchmark measured?

1000 chat-completion calls per provider, executed via the VerticalAPI gateway from EU-West (Paris) and US-East (Virginia) regions. Each call uses a representative production payload (~500 input tokens, ~150 output tokens). Latency p50/p95 are end-to-end time-to-first-token. Quality is a heuristic blend of LLM-as-judge scoring on coding/reasoning/creative prompts plus periodic human spot-checks. See the Methodology section above for the full procedure.

Why does the benchmark show different numbers than vendor marketing pages?

Vendor pages typically report best-case latency from a co-located region with an empty queue. This benchmark measures real-world conditions: cross-region calls, mixed traffic, full payloads. The gap between vendor-claimed and observed latency is usually 1.5-3x — a known finding in production LLM workloads. Reproducing the harness against your own keys is the only way to know what your account will actually see.

Can I run this benchmark myself?

Yes. The benchmark harness is one-click runnable from the VerticalAPI dashboard — it hits all 26 providers with your own API keys (BYOK) and renders fresh p50/p95 charts. The Python script in the Reproducibility section above is a minimal standalone version. Re-running the harness on your account is the only way to know what your rate limits, region, and traffic patterns will produce.

How often is this page updated?

The harness runs daily; the public page is refreshed weekly. Cost numbers are reviewed monthly against vendor pricing pages. Major model launches (e.g. a new GPT-5 or Claude 5) trigger an out-of-cycle re-run. The "last updated" timestamp in the hero header reflects the most recent refresh.

Why is OpenAI's o1 listed at 3.8s TTFT — is that representative?

Yes. o1-family reasoning models intentionally spend "thinking time" before producing a first token. The 3.8s p50 is normal for o1; production code that uses o1 should not stream incremental output the same way it would for GPT-4o. For latency-sensitive applications, o1-mini (~1.6s) is a closer drop-in.

Does VerticalAPI add latency to these numbers?

The VerticalAPI gateway adds ~5-10ms of routing overhead — measurable in synthetic tests but smaller than the natural variance of every provider downstream. All numbers above have the gateway overhead subtracted via a paired control measurement against direct provider endpoints.

Run this benchmark with your own keys

BYOK to all 26 providers. One dashboard. Re-run the benchmark on demand and see your real p50/p95.

Get started — Free →

LLM Benchmark 2026: latency, cost & quality across 26 providers