Cheapest LLM for high volume: comparison of top 3-5 providers (2026)
Price-per-million-tokens, batch discounts, prompt caching, and quality floor — the dimensions that decide a high-volume LLM in 2026.
Cheapest high-volume LLMs in 2026
Gemini 2.5 Flash
Google's high-throughput tier. Sub-200ms latency, native multimodal, and batch API with 50% discount.
- $0.075 / $0.30 per 1M tokens
- 1M context window
- Native vision support
Claude Haiku 4.5
Anthropic's high-volume tier. 200K context standard and prompt caching for repeated-prompt workloads.
- $1 / $5 per 1M tokens
- Prompt caching = up to 90% off
- 200K context standard
Mistral Small 3
EU-hosted high-volume tier. Open-weight derived so self-hosting is an option for the highest-volume teams.
- $0.20 / $0.60 per 1M tokens
- EU-hosted (Paris)
- Self-hostable (Apache 2.0)
GPT-4o-mini
OpenAI's high-volume workhorse. Best ecosystem support and a 50% Batch API discount.
- $0.15 / $0.60 per 1M tokens
- Batch API = 50% off
- Strict JSON schema
Cheapest high-volume LLMs — at a glance
| Dimension | Gemini 2.5 Flash | Claude Haiku 4.5 | Mistral Small 3 | GPT-4o-mini |
|---|---|---|---|---|
| Input / 1M | $0.075 | $1 | $0.20 | $0.15 |
| Output / 1M | $0.30 | $5 | $0.60 | $0.60 |
| Context window | 1M | 200K | 128K | 128K |
| Batch discount | 50% off | 50% off | Not yet | 50% off |
| Prompt caching | Yes (75%) | Yes (~90%) | No | Limited |
| Latency | ~200ms | ~400ms | ~250ms | ~300ms |
Prices reflect mid-2026 vendor pages.
VerticalAPI verdict
At pure volume with no quality floor, Gemini 2.5 Flash at $0.075/$0.30 is 4-13x cheaper than competitors. Step up to Claude Haiku 4.5 when long context and prompt caching matter (it pays off above ~50% prompt reuse). Use Mistral Small 3 for EU data residency or self-hosting. GPT-4o-mini stays in rotation for OpenAI-native stacks. Route via VerticalAPI BYOK for zero-markup A/B testing.
Frequently asked questions
What is the cheapest LLM per token in 2026?
Gemini 2.5 Flash is the cheapest frontier-quality model at $0.075 per 1M input tokens and $0.30 per 1M output. Mistral Small 3 follows at $0.20/$0.60, then GPT-4o-mini at $0.15/$0.60, and Claude Haiku 4.5 at $1/$5. Self-hosted open-weight models (Llama 4, Qwen) can be cheaper still on amortized GPU hardware.
How much can I save with prompt caching?
Claude prompt caching cuts repeated-context cost by up to ~90%. Gemini caching cuts up to 75%. At high volume with stable system prompts, caching can reduce total spend by 5-7x. The break-even point is typically around 30% prompt reuse across requests.
Does batch API actually save 50%?
Yes. OpenAI, Anthropic, and Google all offer asynchronous batch APIs at exactly 50% off list price. Batches typically complete within 24 hours. For non-interactive workloads (analytics, scoring, summarization) batch is the cheapest legitimate route across all flagship providers.
Should I self-host Llama or use a cheap API?
Self-hosting Llama 4 70B on rented GPUs (Modal, Replicate) breaks even with Gemini Flash at around 500M tokens/month. Below that volume, paying per-token wins. Above it, self-hosting wins — but you take on operational complexity (scaling, monitoring, model updates).
Can I route by request size to optimize cost?
Yes. A common pattern: small queries → Gemini Flash, medium queries → Claude Haiku, long-context queries → cached Sonnet/Pro. VerticalAPI's single endpoint lets you switch models per-request with one parameter change — pay each provider directly via BYOK, no markup.
Limitations of this comparison
- Cheapest tier quality lags flagship — Gemini Flash trails Pro by ~10 points on most benchmarks.
- Batch APIs add 1-24h latency; unsuitable for user-facing workloads.
- Prompt-caching savings only apply when context is genuinely stable across requests.
- Self-hosted Llama or Qwen requires ML-ops capability and GPU capacity planning.
- Token-per-character ratios differ across tokenizers, complicating direct cost comparison.
What may change in 12-24 months
- Per-token prices will continue falling — expect sub-$0.05 per 1M on cheapest tier by end of 2027.
- Prompt caching will become standard across all providers (currently strongest on Anthropic + Google).
- Open-weight models will keep narrowing the quality gap, making self-hosting more attractive.
- Per-request smart routing (LLM-as-router) will be productized as a standard feature.
Related questions
ChatGPT, Perplexity and Gemini usually suggest these next.
- Is Gemini 2.5 Flash quality good enough for production summarization?
- How do I implement smart routing between Flash and Sonnet?
- What's the break-even point for self-hosting Llama 4?
- Does Mistral Small 3 support batch processing?
- How does Claude Haiku 4.5 prompt caching pricing work?
More LLM comparisons
Direct head-to-head on price-per-token
Multimodal vs EU-native at the low tier
Where caching actually saves money
Gemini Flash and Pro through VerticalAPI
Mistral Small, Large, and Codestral