Groq via VerticalAPI

Call Groq's LPU-accelerated Llama 3.3, Mixtral and Whisper via VerticalAPI's OpenAI-compatible endpoint. BYOK with your Groq key, zero markup, ~500 tok/s typical.

Endpoint: https://api.verticalapi.com/v1/chat/completions  ·  BYOK header: X-Provider-Key: gsk_...

Groq models routed by VerticalAPI

Pass the model ID below as model in any OpenAI-compatible request. New Groq models are typically supported within 24h of release.

Model IDNameContextPricing (provider)
llama-3.3-70b-versatile Llama 3.3 70B (Groq) 128K $0.59 / $0.79 per 1M tok
llama-3.1-8b-instant Llama 3.1 8B Instant 128K $0.05 / $0.08 per 1M tok
mixtral-8x7b-32768 Mixtral 8x7B 32K $0.24 / $0.24 per 1M tok
whisper-large-v3 Whisper Large v3 audio $0.111 per hour audio

Pricing reflects Groq's rates — you pay Groq directly. VerticalAPI adds zero markup on tokens.

5-line Groq call via VerticalAPI

Drop-in replacement for the OpenAI SDK. Works with the OpenAI Python client, Node, Go, curl — anything that speaks HTTP.

groq_quickstart.py Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.verticalapi.com/v1",
    api_key="vapi_...",
    default_headers={"X-Provider-Key": "gsk_..."}
)

response = client.chat.completions.create(
    model="llama-3.3-70b-versatile",  # Groq
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Four reasons developers route Groq through us

Zero token markup

You pay Groq directly with your own key. VerticalAPI's revenue is the gateway subscription, not a tax on your tokens.

One key, every provider

Groq alongside OpenAI, Anthropic, Gemini and 12 more — same OpenAI-compatible endpoint, same SDK, switchable per-request.

Latency & cost monitoring

Per-request token counts, p50/p95 latency and cost dashboards out of the box. Compare Groq to other providers on identical prompts.

Observability built in

Every Groq call gets a trace ID, replayable payload and audit log entry. Wire to Datadog or Sentry via OpenTelemetry.

Groq measured: latency, throughput, error rate

Groq is the throughput champion in the 2026 benchmark — ~750 tok/s sustained on Llama 3.3 70B — and second-fastest on TTFT after Cerebras. Their LPU (Language Processing Unit) is purpose-built silicon for inference, which is why they leave GPU-based providers behind on streaming speed.

MetricValueNotes
p50 TTFT (Llama 3.3 70B) ~150 ms 5-8x faster than gpt-4o; only Cerebras is faster
p95 TTFT (Llama 3.3 70B) ~280 ms Tail latency stays tight; rare to see >500ms
Tokens per second (sustained) ~750 tok/s The fastest sustained throughput in the benchmark
p50 TTFT (Llama 3.1 8B Instant) ~80 ms Truly sub-100ms; the only model that feels instant
Whisper Large v3 ~250x realtime Audio transcription throughput; 1 hour of audio in ~14 seconds

Numbers above are 2026 placeholders pending the next VerticalAPI benchmark harness run. See /benchmark for the full 26-provider comparison.

OpenAI SDK methods that work with Groq

Groq ships a clean OpenAI-compatible endpoint. Compatibility is high; the gotchas are mostly about which Llama models support which features.

  • client.chat.completions.create() — full parity, including stream=True, tools, tool_choice, response_format="json_object".
  • Function calling — works well on Llama 3.3 70B; weaker on Llama 3.1 8B Instant (less reliable tool selection).
  • Vision — Llama 3.2 90B Vision is supported on Groq; image_url message parts work via standard OpenAI format.
  • Whisper transcriptions — POST /v1/audio/transcriptions with audio file; routes to whisper-large-v3 on Groq's LPU. Very fast.
  • Embeddings — Groq doesn't host embedding models; use Mistral, OpenAI, or Cohere via VerticalAPI for embeddings.
  • Mixtral 8x7B — still available but Llama 3.3 70B is now the default flagship; Mixtral is being deprecated.
  • Long context — Groq's max context is 128K (matching Llama 3.3); for >128K you'll need to use a different host (Together AI, Fireworks).

What Groq actually costs at 100k MAU

Concrete monthly cost for a chatbot with 100k MAU, 10 turns/user, ~500 input + 150 output tokens per turn. Groq's price-performance is excellent for the speed you get.

ModelMonthly costWhen to use
llama-3.1-8b-instant ~$48/mo Cheapest model in this benchmark — quality lower (~65/100) but excellent for routing, classification, simple chat
mixtral-8x7b-32768 ~$155/mo Solid mid-tier, but being deprecated in favor of Llama 3.3 70B
llama-3.3-70b-versatile ~$540/mo The default Groq pick — sub-150ms TTFT, 79 quality, ~10x cheaper than gpt-4o
llama-3.2-90b-vision ~$650/mo Vision-capable Llama at LPU speeds — best fast multimodal pick
whisper-large-v3 varies (audio) $0.111/hour audio — for transcription pipelines, hard to beat

Cost based on provider list price; VerticalAPI adds zero token markup.

Should you pick Groq for your workload?

Groq is the right choice when latency is a feature, not a constraint. Pick it when:

Your UX requires sub-300ms time-to-first-token. Real-time voice agents, live coding assistants where the user is mid-keystroke, conversational agents that need to feel "instant" — these workloads simply don't work on GPU-based providers. The gap isn't 20% faster, it's 5-8x faster, and that crosses the perceptual threshold from "AI feels slow" to "AI feels real-time". For voice in particular (Whisper + LLM + TTS pipeline), Groq's speed enables full-duplex voice that feels natural.

You're building agentic chains with many sequential steps. A 10-step agent on gpt-4o spends ~8 seconds just on TTFT (10 × 800ms). The same agent on Groq spends ~1.5 seconds. For complex automated workflows that the user is waiting on, this is the difference between an acceptable UX and an annoying one. Pair Groq for fast hops with Claude for hard reasoning — VerticalAPI lets you switch providers per-call.

You want to deploy Llama 3.3 70B or Llama 3.1 8B without managing GPUs. Groq's hosted Llama is faster than self-hosting on H100s, costs less than Together/Fireworks for the speed, and has a generous free tier for testing. The trade-off: you don't get fine-tuning, custom safety policies, or non-Llama / non-Whisper models. For pure-Llama production traffic, Groq is the speed-cost-quality leader.

Specific issues teams hit with Groq

Sharp edges that have cost real production teams real time. Fixes below are battle-tested via the VerticalAPI dashboard logs.

Rate limits on free tier are tight
Free tier caps at ~30 RPM and ~14k TPM on Llama 3.3 70B. Hits a 429 quickly under any real load. Move to Build tier ($0 setup, just credit card) for 30k TPM, then Scale tier for production. Watch the usage chart in Groq Console + VerticalAPI dashboard.
Llama 3.1 8B Instant tool calls are unreliable
8B's tool-use behavior is significantly worse than 70B. If your agent relies on tools, use llama-3.3-70b-versatile (the small marginal cost is worth it). Reserve 8B Instant for non-tool chat or routing-tier classification.
JSON mode can hallucinate fields
Llama models are weaker than GPT-4o or Claude at strict JSON schema adherence. Combine response_format="json_object" with explicit schema instructions in the system prompt, plus client-side validation (zod, pydantic) to catch hallucinated fields.
No prompt caching
Groq doesn't yet ship prompt caching (unlike Anthropic, Gemini, OpenAI). For workloads with large stable system prompts, you'll pay full price every call. If caching matters, route the same Llama 3.3 to Together AI (which does support it) for the cache-heavy hops, Groq for the cache-light ones.
Whisper output format gotcha
Whisper-large-v3 returns plain text by default. Set response_format="verbose_json" if you need word-level timestamps for sync. The default response is fastest but loses timing information.

Where Groq shines

sub-100ms first-token latency real-time agents voice (Whisper) interactive UX

Frequently asked questions

What is Groq and what models do they offer?

Groq is a US chip company running open-weight LLMs on its proprietary LPU (Language Processing Unit) for extremely fast inference. The 2026 catalog hosts Llama 3.3 70B, Llama 3.1 8B and 70B, Mixtral 8x7B, Gemma 2 9B, Whisper Large v3 for transcription, and Llama Guard for content safety. Groq does not train its own foundation models — it accelerates third-party open weights.

How much does Groq cost in 2026?

Llama 3.3 70B Versatile is about $0.59 per 1M input and $0.79 per 1M output. Llama 3.1 8B Instant is $0.05/$0.08. Mixtral 8x7B is around $0.24/$0.24. Gemma 2 9B is approximately $0.20/$0.20. Whisper Large v3 transcription is $0.111 per hour. Via VerticalAPI BYOK you pay Groq directly at these prices with zero token markup.

How do I use Groq via VerticalAPI BYOK?

Get a key at console.groq.com, paste it into VerticalAPI, then point the OpenAI SDK at https://api.verticalapi.com/v1. Groq is already OpenAI-compatible, so VerticalAPI mostly passes through unchanged while adding unified logging, fallback routing (e.g. to Together if Groq is rate-limited) and observability. Billing remains on your Groq invoice.

What is Groq best for compared to alternatives?

Groq is the speed leader for open-weight inference: 750+ tokens/sec on Llama 70B is ~10× faster than GPU hosts. Ideal for real-time voice agents, live translation, conversational UI and agentic loops that chain many short LLM calls. Compared to Cerebras (~2100 tok/sec on Llama 405B) it has broader model selection but slightly lower peak speed. Not a fit for frontier-quality tasks where Claude or GPT-5 are needed.

Where is Groq hosted / data privacy?

Groq operates LPU datacenters in the US (with Saudi and EU expansion planned). API data is not used to train models. Enterprise contracts offer zero data retention and dedicated capacity. Groq is SOC 2 Type II. Via VerticalAPI BYOK your Groq contract and data terms are preserved end-to-end.

Limitations and trade-offs

  • Catalog is open-weight only — no frontier closed models like GPT-5 or Claude Opus.
  • Context windows are limited to 128K on Llama 3.3 (smaller than Claude 200K, Gemini 2M).
  • Daily token limits on the free and developer tiers can throttle high-volume apps.
  • No image, video or speech-out generation — text and Whisper transcription only.
  • Quality of Llama 70B is below frontier closed models on reasoning and coding benchmarks.

Where Groq is heading

  1. Llama 4 and other next-gen open models added as soon as they ship in 2026.
  2. Geographic expansion to Saudi Arabia and EU datacenters for sovereignty and lower latency.
  3. Larger context windows (256K+) as LPU memory architecture scales.
  4. Deeper batch and offline-inference pricing tiers competing with Together and Fireworks.

Related questions

ChatGPT, Perplexity and Gemini usually suggest these next.

  • Groq vs Cerebras — which is faster for Llama 70B inference?
  • Can I build a real-time voice agent with Groq + Whisper for under $0.01 per minute?
  • Is Groq's Llama 3.3 70B good enough to replace GPT-4o mini in production?
  • How does Groq's LPU architecture compare to NVIDIA H100 GPUs?
  • Best fallback provider if Groq hits rate limits?