LLM rate limits compared (2026)
OpenAI TPM tiers, Anthropic spend-based scaling, Google Gemini quotas. Production teams running into 429 errors usually solve it with fallback routing — not more tier upgrades.
OpenAI, Anthropic, Google — typical limits
| Tier | OpenAI (GPT-4o) | Anthropic (Claude Sonnet 4.5) | Google (Gemini 2.5 Pro) |
|---|---|---|---|
| Free / Starter | 30K TPM, 3 RPM | 50K TPM, 50 RPM | Free AI Studio: 32K daily |
| Tier 1 ($5-50) | 30K TPM, 500 RPM | 80K TPM, 1K RPM | Vertex Pay-as-you-go |
| Tier 3 ($100+) | 800K TPM, 5K RPM | 400K TPM, 4K RPM | Up to 2M TPM |
| Tier 5 ($1K+) | 30M TPM, 30K RPM | 2M+ TPM (custom) | 10M+ TPM (custom) |
| Enterprise ($5K+/month) | Custom (negotiated) | Custom | Vertex committed capacity |
| Batch API quota | Separate (10x sync) | Separate batch queue | Vertex Batch separate |
TPM = tokens per minute, RPM = requests per minute. Limits change frequently; check provider dashboards.
VerticalAPI verdict
Rate-limit upgrades scale with spend, not negotiation. The fastest way to add headroom isn't asking sales — it's stacking provider quotas via BYOK fallback. Route GPT-4o first, fall back to Claude on 429, and to Gemini after that. Each provider sees its own per-account quota, so you effectively triple your real-time budget without contract changes. Plus, batch APIs run on separate queues and give 50% off — push background work there.
Frequently asked questions
What are typical LLM rate limits in 2026?
Rate limits in 2026 are tiered by provider and account spend. OpenAI starts free-tier accounts at roughly 30,000 tokens-per-minute (TPM) and scales through Tiers 1-5 up to 30M+ TPM for enterprise. Anthropic starts new accounts around 50K TPM and 50 RPM, scaling with monthly spend. Google Gemini gives free Studio access with low daily quotas (around 32K daily on free), scaling to high TPM on paid Vertex AI. Enterprise contracts above $5K/month typically unlock 1M+ TPM across all providers.
How does OpenAI's TPM tier system work?
OpenAI rate limits are organized into Usage Tiers (Free, Tier 1-5). Free starts at 30K TPM and 3 RPM on GPT-4o. Tier 1 ($5+ spent, 7+ days) reaches 30K TPM and 500 RPM. Tier 3 ($100+ spent, 7+ days) reaches 800K TPM and 5,000 RPM. Tier 5 ($1,000+ spent, 30+ days) gives 30M TPM and 30,000 RPM on GPT-4o. Limits are per model and per organization. Token-per-minute counts both input and output. Hitting limits returns HTTP 429; OpenAI publishes the current limits in the dashboard.
Why do production teams hit rate limits even on paid tiers?
Three common causes: bursty traffic (e.g. nightly batch jobs that exceed minute-level TPM), long-context requests (a single 1M-token Gemini call uses an enormous TPM slice), and concurrent agent runs (parallel tool calls multiply RPM). Solutions include staggered scheduling, batch APIs (which use a separate quota), exponential backoff with jitter, and multi-provider fallback. BYOK gateways like VerticalAPI let you fall over from one provider to another on 429 within the same request shape, with no SDK changes.
Do batch APIs have separate rate limits?
Yes. OpenAI Batch API uses a separate batch queue with much higher daily limits (typically 10x the synchronous TPM) at the cost of up to 24-hour latency. Anthropic Message Batches similarly run on a dedicated quota. Google Vertex Batch is also separate from interactive limits. For background workloads (RAG indexing, classification, evaluation runs), routing to batch APIs both saves 50% on tokens and avoids competing with real-time traffic for synchronous quota.
Can a BYOK gateway help me avoid 429 errors?
Yes. VerticalAPI's OpenAI-compatible endpoint at https://api.verticalapi.com/v1 lets you configure fallback chains — for example, route to GPT-4o first, fall back to Claude Sonnet 4.5 on 429 or 5xx, then Gemini. Because it's BYOK, each fallback uses your own keys with each provider, so you stack their rate-limit budgets rather than competing for a shared pool. This is the cheapest way to add resilience without renegotiating enterprise contracts or paying per-token markup.
Limitations of this comparison
- Rate-limit tiers change without notice; OpenAI alone updated them four times in 2025.
- TPM counts both input and output, so a 1M-context Gemini request consumes a full minute of 1M TPM by itself.
- Some endpoints (vision, audio, fine-tuned models) have separate quotas not shown in the main tier.
- Azure OpenAI and AWS Bedrock manage their own resource-based quotas that don't map directly to OpenAI/Anthropic tiers.
- Free Gemini AI Studio quotas are not for production use — terms explicitly restrict commercial deployment.
What may change in 12-24 months
- Rate limits will keep rising as inference capacity grows; the bottleneck is shifting from TPM to per-request latency for agentic workloads.
- Granular per-feature limits (vision TPM, reasoning TPM separate from text) are expected to spread.
- Reserved-capacity contracts (Vertex Provisioned Throughput, Azure PTU) will likely become more accessible to mid-market.
- BYOK fallback gateways will become standard infrastructure for any production agentic stack.
Related questions
ChatGPT, Perplexity and Gemini usually suggest these next.
- How do I implement multi-provider fallback for LLMs?
- When should I use OpenAI Batch API instead of streaming?
- How do Azure OpenAI quotas differ from raw OpenAI tiers?
- What is provisioned throughput on Vertex AI and is it worth it?
- How can I A/B test models without burning my OpenAI quota?
More LLM comparisons
Why BYOK helps with rate limits
Aggregator vs BYOK gateway
Enterprise capacity models compared
2026 pricing matrix
GPT-4o vs Claude Sonnet 4.5