Cerebras via VerticalAPI

Cerebras CS-3 wafer-scale inference (Llama 3.3, Llama 4) via VerticalAPI's OpenAI-compatible endpoint. BYOK with your Cerebras key, zero markup, ~2000 tok/s typical.

Endpoint: https://api.verticalapi.com/v1/chat/completions  ·  BYOK header: X-Provider-Key: csk-...

Cerebras models routed by VerticalAPI

Pass the model ID below as model in any OpenAI-compatible request. New Cerebras models are typically supported within 24h of release.

Model IDNameContextPricing (provider)
llama3.3-70b Llama 3.3 70B (Cerebras) 128K $0.85 / $1.20 per 1M tok
llama-4-scout Llama 4 Scout (Cerebras) 10M Preview pricing — host-dependent
llama3.1-8b Llama 3.1 8B (Cerebras) 128K $0.10 / $0.10 per 1M tok

Pricing reflects Cerebras's rates — you pay Cerebras directly. VerticalAPI adds zero markup on tokens.

5-line Cerebras call via VerticalAPI

Drop-in replacement for the OpenAI SDK. Works with the OpenAI Python client, Node, Go, curl — anything that speaks HTTP.

cerebras_quickstart.py Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.verticalapi.com/v1",
    api_key="vapi_...",
    default_headers={"X-Provider-Key": "csk-..."}
)

response = client.chat.completions.create(
    model="llama3.3-70b",  # Cerebras
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Four reasons developers route Cerebras through us

Zero token markup

You pay Cerebras directly with your own key. VerticalAPI's revenue is the gateway subscription, not a tax on your tokens.

One key, every provider

Cerebras alongside OpenAI, Anthropic, Gemini and 12 more — same OpenAI-compatible endpoint, same SDK, switchable per-request.

Latency & cost monitoring

Per-request token counts, p50/p95 latency and cost dashboards out of the box. Compare Cerebras to other providers on identical prompts.

Observability built in

Every Cerebras call gets a trace ID, replayable payload and audit log entry. Wire to Datadog or Sentry via OpenTelemetry.

Where Cerebras shines

fastest first-token (~70ms) real-time voice agents interactive UX code completion

Frequently asked questions

What is Cerebras and what models do they offer?

Cerebras Systems builds the WSE-3, a wafer-scale AI accelerator with 900,000 cores on a single chip. The Cerebras Inference Cloud hosts open-weight LLMs at industry-leading speed: Llama 3.3 70B, Llama 3.1 8B and 405B, plus DeepSeek R1 distilled and Qwen variants. Cerebras does not train its own foundation models — it accelerates third-party open weights.

How much does Cerebras cost in 2026?

Llama 3.3 70B is roughly $0.85 per 1M input and $1.20 per 1M output. Llama 3.1 8B is around $0.10/$0.10. Llama 3.1 405B sits at $6/$12. DeepSeek R1 distillations are competitively priced. Pricing is on par with Groq and Together while delivering 2–3× higher throughput. Via VerticalAPI BYOK you pay Cerebras directly with zero token markup.

How do I use Cerebras via VerticalAPI BYOK?

Get a key at cloud.cerebras.ai, paste it into VerticalAPI, then point the OpenAI SDK at https://api.verticalapi.com/v1. Cerebras is OpenAI-compatible, so VerticalAPI passes through while adding unified logging, observability and automatic fallback routing (e.g. to Groq or Together if Cerebras is saturated). Billing remains on your Cerebras invoice.

What is Cerebras best for compared to alternatives?

Cerebras holds the speed record for open-weight inference: ~2100 tok/sec on Llama 70B and ~969 tok/sec on Llama 405B — fast enough to make agentic loops and real-time voice feel instantaneous. Compared to Groq, Cerebras is faster on Llama 70B and uniquely hosts Llama 405B at speed. Not a fit for frontier closed models or multimodal — text-only on selected open weights.

Where is Cerebras hosted / data privacy?

Cerebras Inference Cloud runs in US datacenters (Stockton CA, Dallas, Pittsburgh) with sovereign deployments via partners (G42 in UAE). API data is not used to train models. Enterprise contracts include zero data retention, SOC 2 and HIPAA. Via VerticalAPI BYOK your Cerebras contract and data terms remain intact.

Limitations and trade-offs

  • Model catalog is narrow — selected open-weight models only, no frontier closed models.
  • Context windows often capped at 8K–32K on the public tier (full 128K is enterprise).
  • Limited geographic coverage outside the US — higher RTT for European apps.
  • No multimodal (vision, audio, video) — text generation only.
  • Quality ceiling is set by Llama and DeepSeek — below GPT-5 and Claude Opus on hard reasoning.

Where Cerebras is heading

  1. WSE-4 next-generation wafer expected with even higher token throughput.
  2. More open-weight models added (Llama 4, Mistral, Qwen 3) as they ship.
  3. Expanded context windows on public tiers as memory architecture scales.
  4. EU and APAC datacenter deployments through partnerships for sovereign workloads.

Related questions

ChatGPT, Perplexity and Gemini usually suggest these next.

  • Cerebras vs Groq — which open-weight inference provider is faster in 2026?
  • Is Llama 3.1 405B on Cerebras a real GPT-4o competitor?
  • Best provider for agentic loops that need 1000+ tokens/sec?
  • Can Cerebras host my fine-tuned Llama 70B?
  • Cerebras + VerticalAPI fallback routing — how does it work?