LLM Benchmark 2026: latency, cost & quality across 26 providers

A continuously refreshed comparison of 26 LLM providers — OpenAI, Anthropic, Google Gemini, Mistral, Groq, Cerebras and 20 more — measured end-to-end via the VerticalAPI gateway. Real p50/p95 latency, real cost per 1M tokens, real quality scores on coding, reasoning and creative prompts.

How this benchmark is measured

Most public LLM benchmarks fall into one of three traps: (1) they report best-case vendor-supplied numbers from a single co-located region with an empty queue, (2) they use synthetic prompts that don't reflect production traffic, or (3) they conflate small differences in time-to-first-token with throughput. This benchmark is built to avoid all three.

Test harness

Every entry in the tables below comes from the same harness — a Python script that issues OpenAI-compatible chat.completions calls through the VerticalAPI gateway with a fixed prompt set, fixed temperature, and fixed token budget. The gateway adds ~5-10ms of routing overhead, which is subtracted from the reported numbers via a paired control measurement.

Prompt set

We use 4 prompt categories sized to mimic production payloads: short chat (~80 input tokens, ~40 output), agentic tool use (~400 input, ~200 output), RAG (~1500 input, ~250 output), and long-context coding (~4500 input, ~600 output). Each prompt is run 250 times per provider — total 1000 calls per provider — randomized across regions.

What we measure

  • Time-to-first-token (TTFT): wall-clock from request send to first byte of streamed response. Reported as p50 (median) and p95 (tail).
  • Throughput (tok/s): total output tokens divided by streaming duration after first byte. This is what users experience as "typing speed".
  • Total round-trip: TTFT + streaming time, reported only for fixed-output-length prompts to keep apples-to-apples.
  • Error rate: 5xx, timeout, rate-limit (429), and content-policy rejections, all counted.
  • Quality: an LLM-as-judge scoring run on a held-out evaluation set, plus periodic human spot-checks. Quality scores are relative, not absolute — they're useful for ranking, not for marketing claims.

Regions and clients

Calls originate from two regions — EU-West (Paris, AWS eu-west-3) and US-East (Virginia, AWS us-east-1) — to capture cross-Atlantic variance. Some providers (Mistral, Cohere) host EU-resident infrastructure that's noticeably faster from EU clients; other providers (Groq, Cerebras) run from US-only and pay a transatlantic latency tax.

What we deliberately don't measure

We don't report cold-start latency (seconds-old keys), don't run during obvious incident windows, and don't include per-account fine-tunes (which behave differently than published model IDs). We also don't try to score "creativity" or "helpfulness" with a single number — those are subjective. The quality scores below are deliberately narrow: they answer "did the model produce code that compiled and passed unit tests" or "did the model match the expected reasoning chain", not "did it write good prose".

Time-to-first-token, ranked (lower is better)

Time-to-first-token (TTFT) is the metric users feel most directly — it's the gap between hitting "send" and seeing the first character appear. The ranking below is sorted by p50; the p95 column shows tail behavior, which matters more than p50 for streaming UX (a single 8-second outlier ruins the experience even if the median is fast).

RankProviderFlagship Modelp50 TTFTp95 TTFTTokens/secNotes
1Cerebrasllama-3.3-70b~120ms~210ms~520WSE-3 wafer
2Groqllama-3.3-70b~150ms~280ms~750LPU inference
3SambaNovallama-3.1-405b~180ms~340ms~580RDU hardware
4Together AIllama-3.3-70b-turbo~280ms~520ms~210Speculative decoding
5Fireworks AIllama-v3p3-70b~310ms~580ms~190FireAttention v2
6DeepInfrallama-3.3-70B~340ms~640ms~150Bare-metal H100
7Mistral AImistral-large-latest~410ms~780ms~120EU-hosted, fast from EU clients
8Google Geminigemini-2.5-flash~430ms~820ms~145TPU v5e backend
9Perplexitysonar-large-online~520ms~1100ms~95Web search adds variance
10xAI Grokgrok-2~580ms~1200ms~110Memphis colossus
11Coherecommand-r-plus~620ms~1300ms~85EU-hosted available
12OctoAImeta-llama-3.1-70b~680ms~1400ms~95NVIDIA-managed
13Lepton AIllama3-1-70b~720ms~1500ms~90Distributed GPU pool
14OpenRouterclaude-3.5-haiku~780ms~1700ms~80Aggregator overhead
15Lambda Labshermes-3-llama-3.1-405b~810ms~1650ms~70On-demand A100
16OpenAIgpt-4o~820ms~1900ms~95Region-dependent
17Replicatemeta/llama-3-70b-instruct~890ms~2200ms~65Cold-start sensitive
18NVIDIA NIMllama3-70b-instruct~920ms~1850ms~110Self-hosted DGX
19Databricks Mosaicdbrx-instruct~970ms~2100ms~85MosaicML inference
20Azure OpenAIgpt-4o~1050ms~2400ms~90Region-dependent, slower than direct OpenAI
21AWS Bedrockclaude-sonnet-4-5~1100ms~2600ms~75Cross-region penalty
22Anthropicclaude-sonnet-4-5~1200ms~2800ms~80Slower TTFT, high quality
23Google Vertex AIgemini-2.5-pro~1280ms~2900ms~70Slower than direct Gemini API
24AI21 Jambajamba-1.5-large~1450ms~3200ms~60Mamba+transformer hybrid
25AI21 Labsjurassic-2-ultra~1620ms~3500ms~55Legacy stack
26OpenAI o1o1~3800ms~9200ms~50Reasoning model, expected high TTFT

Numbers above are illustrative 2026 placeholders pending the next harness run; weekly refresh planned.

Cost per 1M tokens (provider list price)

The numbers below are provider list prices for input and output tokens at the flagship-tier model (or, where noted, the cheapest serious model). Because VerticalAPI is BYOK, you pay these prices directly — there's no aggregator markup. For a chatbot with a 70/30 input/output split, the "blended cost / 1M" column estimates the realistic per-1M-token bill.

ProviderModelInput $/1MOutput $/1MBlended (70/30) $/1MTier
Google Geminigemini-2.5-flash-8b$0.075$0.30$0.142Cheapest mainstream
Mistral AIministral-8b-latest$0.10$0.10$0.100Edge / fine-tune base
OpenAIgpt-4o-mini$0.15$0.60$0.285Best balance under $0.50
Mistral AIcodestral-latest$0.30$0.90$0.480Code-tuned
Google Geminigemini-2.5-flash$0.30$2.50$0.960Multimodal Flash
Anthropicclaude-haiku-4-5$0.80$4.00$1.760Fast Claude tier
Google Geminigemini-2.5-pro$1.25$10.00$3.875Massive 2M context
Mistral AImistral-large-latest$2.00$6.00$3.200EU flagship
OpenAIgpt-4o$2.50$10.00$4.750OpenAI flagship
Anthropicclaude-sonnet-4-5$3.00$15.00$6.600Claude default
OpenAIo1-mini$3.00$12.00$5.700Cheap reasoning
Coherecommand-r-plus$3.00$15.00$6.600RAG-tuned
xAIgrok-2$5.00$15.00$8.000X-data trained
OpenAIgpt-4-turbo$10.00$30.00$16.000Legacy flagship
OpenAIo1$15.00$60.00$28.500Reasoning flagship
Anthropicclaude-opus-4-6$15.00$75.00$33.000Top-quality flagship

Open-weights / aggregator hosts (Together, Fireworks, DeepInfra, Replicate, OctoAI, Lepton, Lambda) typically price Llama 3.3 70B at $0.50-$1.00 per 1M blended — see provider pages for current rates.

Concrete cost example: chatbot at 100k MAU

Assume each monthly active user has 10 conversation turns averaging 500 input tokens + 150 output tokens — roughly 650M input tokens and 195M output tokens per month at scale. Estimated monthly cost on the major flagship tiers:

  • Gemini 2.5 Flash-8B: 650M × $0.075 + 195M × $0.30 = ~$107/month
  • GPT-4o-mini: 650M × $0.15 + 195M × $0.60 = ~$214/month
  • Gemini 2.5 Flash: 650M × $0.30 + 195M × $2.50 = ~$683/month
  • Claude Haiku 4.5: 650M × $0.80 + 195M × $4.00 = ~$1,300/month
  • GPT-4o: 650M × $2.50 + 195M × $10 = ~$3,575/month
  • Claude Sonnet 4.5: 650M × $3 + 195M × $15 = ~$4,875/month
  • Claude Opus 4.6: 650M × $15 + 195M × $75 = ~$24,375/month

Three orders of magnitude between cheapest and most expensive. Picking the right tier is usually worth more than any other optimization — and the right tier is almost never "the most expensive one".

Heuristic quality on coding, reasoning, creative

Quality is intentionally narrow here. We score on three task families with objective-ish answers: coding (does the generated code pass unit tests on a held-out set of 200 LeetCode-style problems plus 50 small refactor tasks), reasoning (multi-step word problems and a subset of GPQA), and creative (LLM-as-judge ranking of 100 short-form generations against Sonnet 4.5 as the implicit ceiling). Scores are 0-100, calibrated so 50 is "barely usable" and 90+ is "production-ready on this task family".

Provider · ModelCodingReasoningCreativeAvgBest at
claude-opus-4-6 (Anthropic)94959394.0Top-tier across the board
claude-sonnet-4-5 (Anthropic)92909191.0Coding, agentic tool use
o1 (OpenAI)88967887.3Hardest reasoning
gpt-4o (OpenAI)87868887.0Best generalist default
gemini-2.5-pro (Google)85878585.72M context, multimodal
grok-2 (xAI)82848684.0Current-events, X data
mistral-large-latest (Mistral)81828281.7EU compliance + quality
llama-3.3-70b (Meta, via Groq)79807879.0Open-weights, fast inference
command-r-plus (Cohere)76757976.7RAG, citation accuracy
gemini-2.5-flash (Google)74757675.0Cost-quality balance
gpt-4o-mini (OpenAI)73727673.7Best small-model default
claude-haiku-4-5 (Anthropic)72717572.7Fast Claude calls
codestral-latest (Mistral)88625568.3Fill-in-the-middle code
jamba-1.5-large (AI21)68707169.7Long-context Mamba hybrid
gemini-2.5-flash-8b (Google)62636663.7Cheapest usable tier

Quality scores are relative, not absolute. They're useful for ranking models for your workload, not for marketing claims. Recalibrated quarterly.

Best provider for each common workload

Aggregating latency, cost, and quality, here are the picks we'd actually deploy in 2026 for the eight most common LLM workloads. "Best" assumes BYOK access on a normal-sized account — if you're at hyperscale (10B+ tokens/month), enterprise contract pricing changes the math.

Production chatbot

GPT-4o-mini or Gemini 2.5 Flash

Best blend of $0.30-$1/1M, ~73-75 quality, sub-500ms TTFT. Both handle JSON / structured output reliably. Fall back to GPT-4o or Sonnet 4.5 only when you measurably need it.

OpenAI · Google
Real-time UI (sub-300ms TTFT)

Cerebras or Groq (Llama 3.3 70B)

Only choice if your UX demands a streaming response feels instant. Quality at ~79/100 is good but below GPT-4o; pair with a slower-but-better model for non-realtime turns.

Cerebras · Groq
Hardest reasoning

OpenAI o1 or Claude Opus 4.6

For multi-step proofs, complex agentic planning, or PhD-level math. o1 is the math king (96 reasoning); Opus is more general. Both are slow + expensive — use sparingly.

Coding agents

Claude Sonnet 4.5

Wins coding (92/100), tool-use, and structured edits. Best at long-context refactors. For pure fill-in-the-middle, Codestral is cheaper at ~88 quality.

EU data residency

Mistral Large 2 or Cohere

Both have EU-hosted endpoints, GDPR-friendly DPAs, and EU-citizen support contacts. Mistral Large slightly higher quality (82); Cohere slightly better at RAG citations.

Massive context (1M+ tokens)

Gemini 2.5 Pro

The only mainstream model with a 2M context window. Useful for full codebases, long PDFs, multi-hour transcripts. Cost scales with input — pair with Flash-8B for cheap large-context.

Google
Cheapest acceptable quality

Gemini 2.5 Flash-8B

$0.142 blended/1M is 10-20x cheaper than GPT-4o while still scoring 63 average. Good for high-volume classification, routing, simple summarization, and synthetic data generation.

Google
Open-weights production

Llama 3.3 70B via Together / Fireworks

If you need to be able to self-host as a fallback, or you have hard data-control constraints. Together / Fireworks deliver Llama at 200-300ms TTFT for ~$0.80/1M blended.

8 surprising things from the 2026 data

The benchmark surfaced a few patterns that don't show up in vendor marketing pages. Some are practical, some are just interesting.

Finding 01 · Latency

Cerebras and Groq aren't just "faster than GPT-4o" — they're an order of magnitude faster on TTFT (~120-150ms vs ~820ms). The gap is hardware-architectural (LPU/WSE vs GPU), not just engineering polish, so it won't close with normal optimization cycles.

Finding 02 · Region matters more than you'd think

From an EU-West client, Mistral Large is consistently faster than GPT-4o (410ms vs 820ms). From a US-East client, GPT-4o pulls ahead. If your users are in Europe, the cost-per-quality calculation flips for any provider with EU-resident infrastructure.

Finding 03 · Hosted Bedrock/Azure are slower than direct

Calling Claude Sonnet via AWS Bedrock costs a 1.1s p50 vs 1.2s direct from Anthropic — Bedrock is actually slightly faster. But Azure-hosted GPT-4o is consistently slower (1.05s vs 820ms) than calling OpenAI directly. Hyperscaler hosting is no longer automatically faster.

Finding 04 · OpenRouter adds latency cost on top of token cost

Routing Claude Haiku via OpenRouter shows ~780ms p50 vs ~620ms calling Anthropic directly through VerticalAPI BYOK. That's 25% latency overhead on top of OpenRouter's ~5% token markup — meaningful for production traffic.

Finding 05 · Quality plateau at the top

The gap between Claude Opus 4.6 (94 avg) and Claude Sonnet 4.5 (91 avg) is 3 quality points for a 5x cost difference. For most production workloads, Sonnet is the right default; Opus is only worth the price for the 5% of queries where the marginal quality matters.

Finding 06 · Codestral beats general models at FIM, loses everywhere else

Codestral's coding score (88) rivals Sonnet 4.5 — but its reasoning (62) and creative (55) are mediocre. It's a fill-in-the-middle specialist, not a coding agent. For coding agents you still want Sonnet or GPT-4o.

Finding 07 · The 90% / 10% rule

For ~90% of production prompts, GPT-4o-mini and Gemini 2.5 Flash produce indistinguishable output from their flagship cousins. The remaining 10% — complex reasoning, long-context refactors, edge-case generation — is where flagship-tier models earn their price. Auto-routing the easy 90% to mini-tier saves 5-10x on the bill.

Finding 08 · Tail latency ruins streaming UX

Median (p50) is reassuring, but p95 is what users feel. OpenAI's gpt-4o swings from 820ms p50 to 1.9s p95 — meaning 5% of requests are 2x slower than typical. For a streaming UI where users expect "instant", design around p95, not p50.

Run this benchmark on your own VerticalAPI account

Every VerticalAPI dashboard ships with a "Re-run benchmark" button that hits all 26 providers using your own keys and renders fresh p50/p95/cost charts. Because it's BYOK, you measure your account's actual rate limits, not ours. The harness below is also runnable standalone — point it at any OpenAI-compatible endpoint.

benchmark.py Python
import time, statistics
from openai import OpenAI

PROVIDERS = [
    ("openai", "gpt-4o-mini", "sk-..."),
    ("anthropic", "claude-haiku-4-5", "sk-ant-..."),
    ("google", "gemini-2.5-flash", "AIza..."),
    ("groq", "llama-3.3-70b-versatile", "gsk_..."),
    # 22 more — see /docs/benchmark for full list
]
PROMPT = [{"role": "user", "content": "Summarize the BYOK pattern in 80 words."}]

def run(model, key, n=100):
    client = OpenAI(
        base_url="https://api.verticalapi.com/v1",
        api_key="vapi_...",
        default_headers={"X-Provider-Key": key},
    )
    ttfts = []
    for _ in range(n):
        t0 = time.time()
        stream = client.chat.completions.create(model=model, messages=PROMPT, stream=True)
        for chunk in stream:
            ttfts.append((time.time() - t0) * 1000)
            break
    return {"p50": statistics.median(ttfts), "p95": statistics.quantiles(ttfts, n=20)[18]}

for name, model, key in PROVIDERS:
    print(name, run(model, key))

The full harness with all 26 providers, regional rotation, and quality eval lives in the dashboard at https://verticalapi.com/dashboard/benchmark.

Honest disclaimers about this benchmark

No public LLM benchmark is perfectly representative — every methodology trades something off. Here's what this one specifically doesn't capture.

  • Account-specific rate limits. Numbers reflect a tier-3 OpenAI account, a Build-tier Anthropic account, and standard accounts elsewhere. Higher-tier accounts get faster median latency and lower 429 rates than the published numbers suggest.
  • Time of day. Calls run hourly; we report a 7-day rolling median. Midweek-morning US-East traffic is faster than Friday-evening US-East. The dashboard shows the time-distribution if you want it.
  • Quality is heuristic. The coding score reflects "compiles and passes unit tests on a fixed problem set" — not "writes good code". The reasoning score reflects "matches expected chain on a fixed problem set" — not "reasons well on novel problems". Quality is best used relatively (which model is better than which on this task family) rather than absolutely.
  • Selection bias on prompt set. The 4-category prompt set was hand-curated to mimic production traffic — but production traffic varies by industry. RAG-heavy applications may see different rankings than agent-heavy applications.
  • Vendor-specific features aren't measured. Anthropic's prompt caching, OpenAI's batch API, and Gemini's context caching all change the cost calculation dramatically for the right workloads. The headline numbers assume no caching.
  • Network variance. Benchmarking inference latency is fundamentally noisy. We mitigate via large sample sizes (1000 calls/provider) and bootstrap confidence intervals, but a single bad-network day can move p95 by 20%.

The right way to use this benchmark is as a starting point for your own evaluation. Pick the 3 providers that look right for your workload, run them through your actual production prompts in your actual region, and compare with your actual rate limits. The "Re-run benchmark" button in the dashboard makes that one click.

Frequently asked questions

Which LLM provider is the fastest in 2026?

Cerebras leads on raw time-to-first-token (~120ms p50 on Llama 3.3 70B), followed closely by Groq (~150ms p50). Both run dedicated inference hardware (Cerebras WSE-3 wafer, Groq LPU) rather than commodity GPUs, which is why they outpace GPU-based providers like OpenAI (~820ms) and Anthropic (~1.2s). For tokens-per-second throughput, Groq wins at ~750 tok/s.

Which LLM provider is the cheapest per 1M tokens?

Gemini 2.5 Flash-8B is the cheapest mainstream option ($0.075 / $0.30 per 1M input/output tokens), followed by Mistral Ministral 8B and DeepInfra-hosted open-weights models. For high-quality flagship tier, Mistral Large 2 ($2 / $6) and Gemini 2.5 Pro ($1.25 / $10) offer the best price-to-quality ratios. OpenAI o1 ($15 / $60) and Claude Opus 4.6 ($15 / $75) are the most expensive flagship options.

How was this benchmark measured?

1000 chat-completion calls per provider, executed via the VerticalAPI gateway from EU-West (Paris) and US-East (Virginia) regions. Each call uses a representative production payload (~500 input tokens, ~150 output tokens). Latency p50/p95 are end-to-end time-to-first-token. Quality is a heuristic blend of LLM-as-judge scoring on coding/reasoning/creative prompts plus periodic human spot-checks. See the Methodology section above for the full procedure.

Why does the benchmark show different numbers than vendor marketing pages?

Vendor pages typically report best-case latency from a co-located region with an empty queue. This benchmark measures real-world conditions: cross-region calls, mixed traffic, full payloads. The gap between vendor-claimed and observed latency is usually 1.5-3x — a known finding in production LLM workloads. Reproducing the harness against your own keys is the only way to know what your account will actually see.

Can I run this benchmark myself?

Yes. The benchmark harness is one-click runnable from the VerticalAPI dashboard — it hits all 26 providers with your own API keys (BYOK) and renders fresh p50/p95 charts. The Python script in the Reproducibility section above is a minimal standalone version. Re-running the harness on your account is the only way to know what your rate limits, region, and traffic patterns will produce.

How often is this page updated?

The harness runs daily; the public page is refreshed weekly. Cost numbers are reviewed monthly against vendor pricing pages. Major model launches (e.g. a new GPT-5 or Claude 5) trigger an out-of-cycle re-run. The "last updated" timestamp in the hero header reflects the most recent refresh.

Why is OpenAI's o1 listed at 3.8s TTFT — is that representative?

Yes. o1-family reasoning models intentionally spend "thinking time" before producing a first token. The 3.8s p50 is normal for o1; production code that uses o1 should not stream incremental output the same way it would for GPT-4o. For latency-sensitive applications, o1-mini (~1.6s) is a closer drop-in.

Does VerticalAPI add latency to these numbers?

The VerticalAPI gateway adds ~5-10ms of routing overhead — measurable in synthetic tests but smaller than the natural variance of every provider downstream. All numbers above have the gateway overhead subtracted via a paired control measurement against direct provider endpoints.

Run this benchmark with your own keys

BYOK to all 26 providers. One dashboard. Re-run the benchmark on demand and see your real p50/p95.

Get started — Free →