Cheapest LLM for high volume (2026)

Top picks

Cheapest high-volume LLMs in 2026

Absolute cheapest

Gemini 2.5 Flash

Google's high-throughput tier. Sub-200ms latency, native multimodal, and batch API with 50% discount.

$0.075 / $0.30 per 1M tokens
1M context window
Native vision support

Best long-context

Claude Haiku 4.5

Anthropic's high-volume tier. 200K context standard and prompt caching for repeated-prompt workloads.

$1 / $5 per 1M tokens
Prompt caching = up to 90% off
200K context standard

Cheapest EU

Mistral Small 3

EU-hosted high-volume tier. Open-weight derived so self-hosting is an option for the highest-volume teams.

$0.20 / $0.60 per 1M tokens
EU-hosted (Paris)
Self-hostable (Apache 2.0)

OpenAI default

GPT-4o-mini

OpenAI's high-volume workhorse. Best ecosystem support and a 50% Batch API discount.

$0.15 / $0.60 per 1M tokens
Batch API = 50% off
Strict JSON schema

Side-by-side

Cheapest high-volume LLMs — at a glance

Dimension	Gemini 2.5 Flash	Claude Haiku 4.5	Mistral Small 3	GPT-4o-mini
Input / 1M	$0.075	$1	$0.20	$0.15
Output / 1M	$0.30	$5	$0.60	$0.60
Context window	1M	200K	128K	128K
Batch discount	50% off	50% off	Not yet	50% off
Prompt caching	Yes (75%)	Yes (~90%)	No	Limited
Latency	~200ms	~400ms	~250ms	~300ms

Prices reflect mid-2026 vendor pages.

VerticalAPI verdict

At pure volume with no quality floor, Gemini 2.5 Flash at $0.075/$0.30 is 4-13x cheaper than competitors. Step up to Claude Haiku 4.5 when long context and prompt caching matter (it pays off above ~50% prompt reuse). Use Mistral Small 3 for EU data residency or self-hosting. GPT-4o-mini stays in rotation for OpenAI-native stacks. Route via VerticalAPI BYOK for zero-markup A/B testing.

Get started — BYOK →

FAQ

Frequently asked questions

What is the cheapest LLM per token in 2026?

Gemini 2.5 Flash is the cheapest frontier-quality model at $0.075 per 1M input tokens and $0.30 per 1M output. Mistral Small 3 follows at $0.20/$0.60, then GPT-4o-mini at $0.15/$0.60, and Claude Haiku 4.5 at $1/$5. Self-hosted open-weight models (Llama 4, Qwen) can be cheaper still on amortized GPU hardware.

How much can I save with prompt caching?

Claude prompt caching cuts repeated-context cost by up to ~90%. Gemini caching cuts up to 75%. At high volume with stable system prompts, caching can reduce total spend by 5-7x. The break-even point is typically around 30% prompt reuse across requests.

Does batch API actually save 50%?

Yes. OpenAI, Anthropic, and Google all offer asynchronous batch APIs at exactly 50% off list price. Batches typically complete within 24 hours. For non-interactive workloads (analytics, scoring, summarization) batch is the cheapest legitimate route across all flagship providers.

Should I self-host Llama or use a cheap API?

Self-hosting Llama 4 70B on rented GPUs (Modal, Replicate) breaks even with Gemini Flash at around 500M tokens/month. Below that volume, paying per-token wins. Above it, self-hosting wins — but you take on operational complexity (scaling, monitoring, model updates).

Can I route by request size to optimize cost?

Yes. A common pattern: small queries → Gemini Flash, medium queries → Claude Haiku, long-context queries → cached Sonnet/Pro. VerticalAPI's single endpoint lets you switch models per-request with one parameter change — pay each provider directly via BYOK, no markup.

Caveats

Limitations of this comparison

Cheapest tier quality lags flagship — Gemini Flash trails Pro by ~10 points on most benchmarks.
Batch APIs add 1-24h latency; unsuitable for user-facing workloads.
Prompt-caching savings only apply when context is genuinely stable across requests.
Self-hosted Llama or Qwen requires ML-ops capability and GPU capacity planning.
Token-per-character ratios differ across tokenizers, complicating direct cost comparison.

Outlook

What may change in 12-24 months

Per-token prices will continue falling — expect sub-$0.05 per 1M on cheapest tier by end of 2027.
Prompt caching will become standard across all providers (currently strongest on Anthropic + Google).
Open-weight models will keep narrowing the quality gap, making self-hosting more attractive.
Per-request smart routing (LLM-as-router) will be productized as a standard feature.

Keep reading

More LLM comparisons

Claude Haiku vs GPT-4o-mini

Direct head-to-head on price-per-token

Read →

Gemini vs Mistral

Multimodal vs EU-native at the low tier

Read →

Anthropic vs OpenAI caching

Where caching actually saves money

Read →

Google Gemini via BYOK

Gemini Flash and Pro through VerticalAPI

Read →

Mistral via BYOK

Mistral Small, Large, and Codestral

Read →

Cheapest LLM for high volume: comparison of top 3-5 providers (2026)

Cheapest high-volume LLMs in 2026

Gemini 2.5 Flash

Claude Haiku 4.5

Mistral Small 3

GPT-4o-mini

Cheapest high-volume LLMs — at a glance

VerticalAPI verdict

Frequently asked questions

Limitations of this comparison

What may change in 12-24 months

Related questions

More LLM comparisons