Replicate via VerticalAPI

Replicate's broad model catalog (Llama, FLUX image gen, Whisper, custom) via VerticalAPI's OpenAI-compatible endpoint where applicable. BYOK with your Replicate token, zero markup.

Endpoint: https://api.verticalapi.com/v1/chat/completions  ·  BYOK header: X-Provider-Key: r8_...

Replicate models routed by VerticalAPI

Pass the model ID below as model in any OpenAI-compatible request. New Replicate models are typically supported within 24h of release.

Model IDNameContextPricing (provider)
meta/meta-llama-3.3-70b-instruct Llama 3.3 70B (Replicate) 128K $0.65 / $2.75 per 1M tok
black-forest-labs/flux-1.1-pro FLUX 1.1 Pro (image) image $0.04 per image
openai/whisper Whisper (audio) audio $0.0029 per minute

Pricing reflects Replicate's rates — you pay Replicate directly. VerticalAPI adds zero markup on tokens.

5-line Replicate call via VerticalAPI

Drop-in replacement for the OpenAI SDK. Works with the OpenAI Python client, Node, Go, curl — anything that speaks HTTP.

replicate_quickstart.py Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.verticalapi.com/v1",
    api_key="vapi_...",
    default_headers={"X-Provider-Key": "r8_..."}
)

response = client.chat.completions.create(
    model="meta/meta-llama-3.3-70b-instruct",  # Replicate
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Four reasons developers route Replicate through us

Zero token markup

You pay Replicate directly with your own key. VerticalAPI's revenue is the gateway subscription, not a tax on your tokens.

One key, every provider

Replicate alongside OpenAI, Anthropic, Gemini and 12 more — same OpenAI-compatible endpoint, same SDK, switchable per-request.

Latency & cost monitoring

Per-request token counts, p50/p95 latency and cost dashboards out of the box. Compare Replicate to other providers on identical prompts.

Observability built in

Every Replicate call gets a trace ID, replayable payload and audit log entry. Wire to Datadog or Sentry via OpenTelemetry.

Where Replicate shines

custom model hosting image generation (FLUX) audio transcription research models

Frequently asked questions

What is Replicate and what models do they offer?

Replicate is a Y Combinator-backed community model platform with 15,000+ models. The 2026 catalog covers text (Llama 3.3, Mixtral, DeepSeek, Qwen), image (FLUX.1 dev/pro/schnell, SDXL, Stable Diffusion 3, Recraft V3), video (HunyuanVideo, Mochi, AnimateDiff), audio (Whisper, Bark, MusicGen, F5-TTS), depth/segmentation models, embeddings (BGE, Nomic) and thousands of community fine-tunes.

How much does Replicate cost in 2026?

Replicate prices by GPU runtime: roughly $0.000225 per second on T4, $0.000725 on A40, $0.0011 on A100, $0.0014 on H100. For LLMs this often translates to $0.10–$0.80 per 1M tokens depending on model size and GPU. FLUX.1 schnell is around $0.003 per image, FLUX.1 pro is approximately $0.055 per image. HunyuanVideo is dollars per generated video. Via VerticalAPI BYOK you pay Replicate directly with zero markup.

How do I use Replicate via VerticalAPI BYOK?

Create a token at replicate.com/account/api-tokens, paste it into VerticalAPI, then point the OpenAI SDK at https://api.verticalapi.com/v1. VerticalAPI wraps Replicate's async prediction API into synchronous OpenAI-style chat completions for LLM models and exposes media generation endpoints for image/video/audio. Billing stays on your Replicate account.

What is Replicate best for compared to alternatives?

Replicate wins on multimodal breadth: it is the easiest single source for image, video, audio, speech and niche community models alongside text. Compared to OpenAI/Anthropic it lacks frontier text quality but excels for creative AI. Compared to Together/Fireworks for LLMs it is slower and less cost-optimized, but covers media generation those platforms do not. Per-second pricing rewards short fast workloads.

Where is Replicate hosted / data privacy?

Replicate runs on its own GPU infrastructure and partner clouds across US datacenters. Inputs and outputs are not used to train models on the paid tier (community models have varying terms). Enterprise tier offers zero retention. Via VerticalAPI BYOK your Replicate contract terms remain intact.

Limitations and trade-offs

  • Per-second GPU billing can be hard to forecast vs predictable per-token pricing on other providers.
  • Cold-start latency on rarely-used community models is several seconds (warm models are fast).
  • LLM throughput is lower than specialized hosts (Groq, Together, Fireworks).
  • Quality of community fine-tunes varies wildly — some are unmaintained or low-quality.
  • Geographic coverage is US-only — higher RTT for EU/Asia traffic.

Where Replicate is heading

  1. Faster cold-starts and improved batch APIs through 2026.
  2. More frontier open-weight models (Llama 4, Stable Video 3) added quickly after release.
  3. Better fine-tuning UX with simpler training and serving flow.
  4. EU region launch expected to address GDPR-conscious customers.

Related questions

ChatGPT, Perplexity and Gemini usually suggest these next.

  • Replicate vs FAL.ai for image generation — which is faster and cheaper?
  • Best Replicate model for video generation in 2026?
  • Is Replicate worth using for LLM inference vs Together?
  • How to deploy a custom fine-tuned model on Replicate?
  • Replicate FLUX.1 pro vs OpenAI DALL·E 3 — quality comparison?