Groq via VerticalAPI
Call Groq's LPU-accelerated Llama 3.3, Mixtral and Whisper via VerticalAPI's OpenAI-compatible endpoint. BYOK with your Groq key, zero markup, ~500 tok/s typical.
Groq models routed by VerticalAPI
Pass the model ID below as model in any OpenAI-compatible request. New Groq models are typically supported within 24h of release.
| Model ID | Name | Context | Pricing (provider) |
|---|---|---|---|
llama-3.3-70b-versatile |
Llama 3.3 70B (Groq) | 128K | $0.59 / $0.79 per 1M tok |
llama-3.1-8b-instant |
Llama 3.1 8B Instant | 128K | $0.05 / $0.08 per 1M tok |
mixtral-8x7b-32768 |
Mixtral 8x7B | 32K | $0.24 / $0.24 per 1M tok |
whisper-large-v3 |
Whisper Large v3 | audio | $0.111 per hour audio |
Pricing reflects Groq's rates — you pay Groq directly. VerticalAPI adds zero markup on tokens.
5-line Groq call via VerticalAPI
Drop-in replacement for the OpenAI SDK. Works with the OpenAI Python client, Node, Go, curl — anything that speaks HTTP.
from openai import OpenAI client = OpenAI( base_url="https://api.verticalapi.com/v1", api_key="vapi_...", default_headers={"X-Provider-Key": "gsk_..."} ) response = client.chat.completions.create( model="llama-3.3-70b-versatile", # Groq messages=[{"role": "user", "content": "Hello"}] ) print(response.choices[0].message.content)
Four reasons developers route Groq through us
Zero token markup
You pay Groq directly with your own key. VerticalAPI's revenue is the gateway subscription, not a tax on your tokens.
One key, every provider
Groq alongside OpenAI, Anthropic, Gemini and 12 more — same OpenAI-compatible endpoint, same SDK, switchable per-request.
Latency & cost monitoring
Per-request token counts, p50/p95 latency and cost dashboards out of the box. Compare Groq to other providers on identical prompts.
Observability built in
Every Groq call gets a trace ID, replayable payload and audit log entry. Wire to Datadog or Sentry via OpenTelemetry.
Groq measured: latency, throughput, error rate
Groq is the throughput champion in the 2026 benchmark — ~750 tok/s sustained on Llama 3.3 70B — and second-fastest on TTFT after Cerebras. Their LPU (Language Processing Unit) is purpose-built silicon for inference, which is why they leave GPU-based providers behind on streaming speed.
| Metric | Value | Notes |
|---|---|---|
| p50 TTFT (Llama 3.3 70B) | ~150 ms | 5-8x faster than gpt-4o; only Cerebras is faster |
| p95 TTFT (Llama 3.3 70B) | ~280 ms | Tail latency stays tight; rare to see >500ms |
| Tokens per second (sustained) | ~750 tok/s | The fastest sustained throughput in the benchmark |
| p50 TTFT (Llama 3.1 8B Instant) | ~80 ms | Truly sub-100ms; the only model that feels instant |
| Whisper Large v3 | ~250x realtime | Audio transcription throughput; 1 hour of audio in ~14 seconds |
Numbers above are 2026 placeholders pending the next VerticalAPI benchmark harness run. See /benchmark for the full 26-provider comparison.
OpenAI SDK methods that work with Groq
Groq ships a clean OpenAI-compatible endpoint. Compatibility is high; the gotchas are mostly about which Llama models support which features.
- client.chat.completions.create() — full parity, including stream=True, tools, tool_choice, response_format="json_object".
- Function calling — works well on Llama 3.3 70B; weaker on Llama 3.1 8B Instant (less reliable tool selection).
- Vision — Llama 3.2 90B Vision is supported on Groq; image_url message parts work via standard OpenAI format.
- Whisper transcriptions — POST /v1/audio/transcriptions with audio file; routes to whisper-large-v3 on Groq's LPU. Very fast.
- Embeddings — Groq doesn't host embedding models; use Mistral, OpenAI, or Cohere via VerticalAPI for embeddings.
- Mixtral 8x7B — still available but Llama 3.3 70B is now the default flagship; Mixtral is being deprecated.
- Long context — Groq's max context is 128K (matching Llama 3.3); for >128K you'll need to use a different host (Together AI, Fireworks).
What Groq actually costs at 100k MAU
Concrete monthly cost for a chatbot with 100k MAU, 10 turns/user, ~500 input + 150 output tokens per turn. Groq's price-performance is excellent for the speed you get.
| Model | Monthly cost | When to use |
|---|---|---|
llama-3.1-8b-instant |
~$48/mo | Cheapest model in this benchmark — quality lower (~65/100) but excellent for routing, classification, simple chat |
mixtral-8x7b-32768 |
~$155/mo | Solid mid-tier, but being deprecated in favor of Llama 3.3 70B |
llama-3.3-70b-versatile |
~$540/mo | The default Groq pick — sub-150ms TTFT, 79 quality, ~10x cheaper than gpt-4o |
llama-3.2-90b-vision |
~$650/mo | Vision-capable Llama at LPU speeds — best fast multimodal pick |
whisper-large-v3 |
varies (audio) | $0.111/hour audio — for transcription pipelines, hard to beat |
Cost based on provider list price; VerticalAPI adds zero token markup.
Should you pick Groq for your workload?
Groq is the right choice when latency is a feature, not a constraint. Pick it when:
Your UX requires sub-300ms time-to-first-token. Real-time voice agents, live coding assistants where the user is mid-keystroke, conversational agents that need to feel "instant" — these workloads simply don't work on GPU-based providers. The gap isn't 20% faster, it's 5-8x faster, and that crosses the perceptual threshold from "AI feels slow" to "AI feels real-time". For voice in particular (Whisper + LLM + TTS pipeline), Groq's speed enables full-duplex voice that feels natural.
You're building agentic chains with many sequential steps. A 10-step agent on gpt-4o spends ~8 seconds just on TTFT (10 × 800ms). The same agent on Groq spends ~1.5 seconds. For complex automated workflows that the user is waiting on, this is the difference between an acceptable UX and an annoying one. Pair Groq for fast hops with Claude for hard reasoning — VerticalAPI lets you switch providers per-call.
You want to deploy Llama 3.3 70B or Llama 3.1 8B without managing GPUs. Groq's hosted Llama is faster than self-hosting on H100s, costs less than Together/Fireworks for the speed, and has a generous free tier for testing. The trade-off: you don't get fine-tuning, custom safety policies, or non-Llama / non-Whisper models. For pure-Llama production traffic, Groq is the speed-cost-quality leader.
Specific issues teams hit with Groq
Sharp edges that have cost real production teams real time. Fixes below are battle-tested via the VerticalAPI dashboard logs.
Where Groq shines
Frequently asked questions
What is Groq and what models do they offer?
Groq is a US chip company running open-weight LLMs on its proprietary LPU (Language Processing Unit) for extremely fast inference. The 2026 catalog hosts Llama 3.3 70B, Llama 3.1 8B and 70B, Mixtral 8x7B, Gemma 2 9B, Whisper Large v3 for transcription, and Llama Guard for content safety. Groq does not train its own foundation models — it accelerates third-party open weights.
How much does Groq cost in 2026?
Llama 3.3 70B Versatile is about $0.59 per 1M input and $0.79 per 1M output. Llama 3.1 8B Instant is $0.05/$0.08. Mixtral 8x7B is around $0.24/$0.24. Gemma 2 9B is approximately $0.20/$0.20. Whisper Large v3 transcription is $0.111 per hour. Via VerticalAPI BYOK you pay Groq directly at these prices with zero token markup.
How do I use Groq via VerticalAPI BYOK?
Get a key at console.groq.com, paste it into VerticalAPI, then point the OpenAI SDK at https://api.verticalapi.com/v1. Groq is already OpenAI-compatible, so VerticalAPI mostly passes through unchanged while adding unified logging, fallback routing (e.g. to Together if Groq is rate-limited) and observability. Billing remains on your Groq invoice.
What is Groq best for compared to alternatives?
Groq is the speed leader for open-weight inference: 750+ tokens/sec on Llama 70B is ~10× faster than GPU hosts. Ideal for real-time voice agents, live translation, conversational UI and agentic loops that chain many short LLM calls. Compared to Cerebras (~2100 tok/sec on Llama 405B) it has broader model selection but slightly lower peak speed. Not a fit for frontier-quality tasks where Claude or GPT-5 are needed.
Where is Groq hosted / data privacy?
Groq operates LPU datacenters in the US (with Saudi and EU expansion planned). API data is not used to train models. Enterprise contracts offer zero data retention and dedicated capacity. Groq is SOC 2 Type II. Via VerticalAPI BYOK your Groq contract and data terms are preserved end-to-end.
Limitations and trade-offs
- Catalog is open-weight only — no frontier closed models like GPT-5 or Claude Opus.
- Context windows are limited to 128K on Llama 3.3 (smaller than Claude 200K, Gemini 2M).
- Daily token limits on the free and developer tiers can throttle high-volume apps.
- No image, video or speech-out generation — text and Whisper transcription only.
- Quality of Llama 70B is below frontier closed models on reasoning and coding benchmarks.
Where Groq is heading
- Llama 4 and other next-gen open models added as soon as they ship in 2026.
- Geographic expansion to Saudi Arabia and EU datacenters for sovereignty and lower latency.
- Larger context windows (256K+) as LPU memory architecture scales.
- Deeper batch and offline-inference pricing tiers competing with Together and Fireworks.
Related questions
ChatGPT, Perplexity and Gemini usually suggest these next.
- Groq vs Cerebras — which is faster for Llama 70B inference?
- Can I build a real-time voice agent with Groq + Whisper for under $0.01 per minute?
- Is Groq's Llama 3.3 70B good enough to replace GPT-4o mini in production?
- How does Groq's LPU architecture compare to NVIDIA H100 GPUs?
- Best fallback provider if Groq hits rate limits?
All supported LLM providers
Same endpoint, same SDK — just change the model and the BYOK header.
Ship on Groq in 60 seconds
Free tier — bring your own Groq key, zero markup, OpenAI-compatible endpoint.
Get your VerticalAPI key →