DeepInfra via VerticalAPI
DeepInfra's low-cost open-weights catalog (Llama 3.3, Qwen 2.5, Mixtral, DeepSeek) via VerticalAPI's OpenAI-compatible endpoint. BYOK, zero markup, $0.10/M tokens for small models.
DeepInfra models routed by VerticalAPI
Pass the model ID below as model in any OpenAI-compatible request. New DeepInfra models are typically supported within 24h of release.
| Model ID | Name | Context | Pricing (provider) |
|---|---|---|---|
meta-llama/Llama-3.3-70B-Instruct |
Llama 3.3 70B | 128K | $0.23 / $0.40 per 1M tok |
Qwen/Qwen2.5-72B-Instruct |
Qwen2.5 72B | 32K | $0.35 / $0.40 per 1M tok |
mistralai/Mixtral-8x7B-Instruct-v0.1 |
Mixtral 8x7B | 32K | $0.24 / $0.24 per 1M tok |
deepseek-ai/DeepSeek-V3 |
DeepSeek V3 | 64K | $0.49 / $0.89 per 1M tok |
Pricing reflects DeepInfra's rates — you pay DeepInfra directly. VerticalAPI adds zero markup on tokens.
5-line DeepInfra call via VerticalAPI
Drop-in replacement for the OpenAI SDK. Works with the OpenAI Python client, Node, Go, curl — anything that speaks HTTP.
from openai import OpenAI client = OpenAI( base_url="https://api.verticalapi.com/v1", api_key="vapi_...", default_headers={"X-Provider-Key": "..."} ) response = client.chat.completions.create( model="meta-llama/Llama-3.3-70B-Instruct", # DeepInfra messages=[{"role": "user", "content": "Hello"}] ) print(response.choices[0].message.content)
Four reasons developers route DeepInfra through us
Zero token markup
You pay DeepInfra directly with your own key. VerticalAPI's revenue is the gateway subscription, not a tax on your tokens.
One key, every provider
DeepInfra alongside OpenAI, Anthropic, Gemini and 12 more — same OpenAI-compatible endpoint, same SDK, switchable per-request.
Latency & cost monitoring
Per-request token counts, p50/p95 latency and cost dashboards out of the box. Compare DeepInfra to other providers on identical prompts.
Observability built in
Every DeepInfra call gets a trace ID, replayable payload and audit log entry. Wire to Datadog or Sentry via OpenTelemetry.
Where DeepInfra shines
Frequently asked questions
What is DeepInfra and what models do they offer?
DeepInfra is a low-cost open-weight inference cloud. The 2026 catalog includes Llama 3.3 70B, Llama 3.1 8B and 405B, Mixtral 8x7B and 8x22B, Qwen 2.5 (7B–72B), DeepSeek V3 and R1, Gemma 2, plus FLUX.1 and SDXL for image generation, Whisper for transcription, and BGE / E5 embeddings. All exposed via an OpenAI-compatible API.
How much does DeepInfra cost in 2026?
Llama 3.3 70B is roughly $0.23 per 1M input and $0.40 per 1M output — among the cheapest production rates. Llama 8B is ~$0.03/$0.05. Mixtral 8x7B is around $0.24/$0.24. Qwen 2.5 72B is in the $0.40 range. FLUX.1 schnell is ~$0.0003 per image. Whisper is roughly $0.0005 per audio minute. Via VerticalAPI BYOK you pay DeepInfra directly at list with zero markup.
How do I use DeepInfra via VerticalAPI BYOK?
Create a key at deepinfra.com/dash/api_keys, paste it into VerticalAPI, then point the OpenAI SDK at https://api.verticalapi.com/v1. DeepInfra is OpenAI-compatible; VerticalAPI passes through, adds unified logging and can fall back to Together, Fireworks or Groq if DeepInfra rate-limits. Billing stays on your DeepInfra account.
What is DeepInfra best for compared to alternatives?
DeepInfra is the cheap-tier price leader for open-weight inference — typically 30–50% below Together and Fireworks on Llama 3.3 70B. Ideal for high-volume background jobs, batch summarization, embeddings at scale and cost-sensitive RAG. Compared to Groq it is slower but much cheaper per token. Compared to AWS Bedrock it lacks enterprise compliance but wins on price. Not for frontier closed models.
Where is DeepInfra hosted / data privacy?
DeepInfra runs on US GPU datacenters. API data is not used to train models per the standard ToS. Enterprise tier offers zero data retention. Geographic coverage is limited compared to hyperscalers. Via VerticalAPI BYOK your DeepInfra contract terms remain intact.
Limitations and trade-offs
- Inference speed is moderate — slower than Groq, Cerebras or Fireworks on the same model.
- Geographic coverage is narrow (US-only) — higher RTT for EU/Asia traffic.
- Enterprise compliance certifications (SOC 2, HIPAA, ISO) are less mature than hyperscalers.
- Catalog freshness lags by days behind hottest open-weight releases.
- Dedicated endpoint and fine-tuning offerings are thinner than Fireworks/Together.
Where DeepInfra is heading
- More frequent catalog updates as new open-weight models ship.
- Faster inference via kernel optimizations and new GPU classes.
- Better dedicated endpoint and fine-tuning options targeting Fireworks/Together overlap.
- EU region launch for sovereignty and lower European latency.
Related questions
ChatGPT, Perplexity and Gemini usually suggest these next.
- DeepInfra vs Together — which is actually cheapest for Llama 3.3 70B in 2026?
- Is DeepInfra production-grade or just cheap?
- Best provider for cheap embeddings at scale?
- How does DeepInfra latency compare to Groq?
- Can I run DeepSeek R1 on DeepInfra cheaper than direct DeepSeek?
All supported LLM providers
Same endpoint, same SDK — just change the model and the BYOK header.
Ship on DeepInfra in 60 seconds
Free tier — bring your own DeepInfra key, zero markup, OpenAI-compatible endpoint.
Get your VerticalAPI key →