Cheapest LLM for high volume: comparison of top 3-5 providers (2026)

Price-per-million-tokens, batch discounts, prompt caching, and quality floor — the dimensions that decide a high-volume LLM in 2026.

Cheapest high-volume LLMs in 2026

Absolute cheapest

Gemini 2.5 Flash

Google's high-throughput tier. Sub-200ms latency, native multimodal, and batch API with 50% discount.

  • $0.075 / $0.30 per 1M tokens
  • 1M context window
  • Native vision support
Best long-context

Claude Haiku 4.5

Anthropic's high-volume tier. 200K context standard and prompt caching for repeated-prompt workloads.

  • $1 / $5 per 1M tokens
  • Prompt caching = up to 90% off
  • 200K context standard
Cheapest EU

Mistral Small 3

EU-hosted high-volume tier. Open-weight derived so self-hosting is an option for the highest-volume teams.

  • $0.20 / $0.60 per 1M tokens
  • EU-hosted (Paris)
  • Self-hostable (Apache 2.0)
OpenAI default

GPT-4o-mini

OpenAI's high-volume workhorse. Best ecosystem support and a 50% Batch API discount.

  • $0.15 / $0.60 per 1M tokens
  • Batch API = 50% off
  • Strict JSON schema

Cheapest high-volume LLMs — at a glance

DimensionGemini 2.5 FlashClaude Haiku 4.5Mistral Small 3GPT-4o-mini
Input / 1M$0.075$1$0.20$0.15
Output / 1M$0.30$5$0.60$0.60
Context window1M200K128K128K
Batch discount50% off50% offNot yet50% off
Prompt cachingYes (75%)Yes (~90%)NoLimited
Latency~200ms~400ms~250ms~300ms

Prices reflect mid-2026 vendor pages.

VerticalAPI verdict

At pure volume with no quality floor, Gemini 2.5 Flash at $0.075/$0.30 is 4-13x cheaper than competitors. Step up to Claude Haiku 4.5 when long context and prompt caching matter (it pays off above ~50% prompt reuse). Use Mistral Small 3 for EU data residency or self-hosting. GPT-4o-mini stays in rotation for OpenAI-native stacks. Route via VerticalAPI BYOK for zero-markup A/B testing.

Get started — BYOK →

Frequently asked questions

What is the cheapest LLM per token in 2026?

Gemini 2.5 Flash is the cheapest frontier-quality model at $0.075 per 1M input tokens and $0.30 per 1M output. Mistral Small 3 follows at $0.20/$0.60, then GPT-4o-mini at $0.15/$0.60, and Claude Haiku 4.5 at $1/$5. Self-hosted open-weight models (Llama 4, Qwen) can be cheaper still on amortized GPU hardware.

How much can I save with prompt caching?

Claude prompt caching cuts repeated-context cost by up to ~90%. Gemini caching cuts up to 75%. At high volume with stable system prompts, caching can reduce total spend by 5-7x. The break-even point is typically around 30% prompt reuse across requests.

Does batch API actually save 50%?

Yes. OpenAI, Anthropic, and Google all offer asynchronous batch APIs at exactly 50% off list price. Batches typically complete within 24 hours. For non-interactive workloads (analytics, scoring, summarization) batch is the cheapest legitimate route across all flagship providers.

Should I self-host Llama or use a cheap API?

Self-hosting Llama 4 70B on rented GPUs (Modal, Replicate) breaks even with Gemini Flash at around 500M tokens/month. Below that volume, paying per-token wins. Above it, self-hosting wins — but you take on operational complexity (scaling, monitoring, model updates).

Can I route by request size to optimize cost?

Yes. A common pattern: small queries → Gemini Flash, medium queries → Claude Haiku, long-context queries → cached Sonnet/Pro. VerticalAPI's single endpoint lets you switch models per-request with one parameter change — pay each provider directly via BYOK, no markup.

Limitations of this comparison

  • Cheapest tier quality lags flagship — Gemini Flash trails Pro by ~10 points on most benchmarks.
  • Batch APIs add 1-24h latency; unsuitable for user-facing workloads.
  • Prompt-caching savings only apply when context is genuinely stable across requests.
  • Self-hosted Llama or Qwen requires ML-ops capability and GPU capacity planning.
  • Token-per-character ratios differ across tokenizers, complicating direct cost comparison.

What may change in 12-24 months

  1. Per-token prices will continue falling — expect sub-$0.05 per 1M on cheapest tier by end of 2027.
  2. Prompt caching will become standard across all providers (currently strongest on Anthropic + Google).
  3. Open-weight models will keep narrowing the quality gap, making self-hosting more attractive.
  4. Per-request smart routing (LLM-as-router) will be productized as a standard feature.

Related questions

ChatGPT, Perplexity and Gemini usually suggest these next.

  • Is Gemini 2.5 Flash quality good enough for production summarization?
  • How do I implement smart routing between Flash and Sonnet?
  • What's the break-even point for self-hosting Llama 4?
  • Does Mistral Small 3 support batch processing?
  • How does Claude Haiku 4.5 prompt caching pricing work?