Together AI vs Replicate: open-weight inference (2026)

Together AI and Replicate both serve open-weight models, but they monetise very differently. Together prices per token on hot models; Replicate prices per second of GPU time and exposes a vast community model catalogue. Below: a head-to-head on the dimensions that matter when you ship.

Together AI vs Replicate — at a glance

DimensionTogether AIReplicate
Pricing modelPer token (~$0.60-1 per 1M for Llama 3.3 70B)Per second of GPU time ($0.001-0.005/sec typical)
Catalogue200+ curated open LLMsThousands of community models (LLM, image, audio, video)
Hot LLM latencyLow (hosted endpoints, warm)Variable — cold-start common on rare models
Fine-tuningLoRA + full fine-tuningLoRA fine-tunes for many models
API styleOpenAI-compatibleJob-based + OpenAI-compatible for popular models
Best forPredictable chat / LLM cost, fine-tuned Llamas at scaleCommunity models, image / audio / video, long-tail open LLMs

Pick Together AI or Replicate?

When to choose Together AI

Choose Together AI when your workload is chat or RAG on flagship open models and you want predictable per-token economics with warm endpoints. Llama 3.3 70B at ~$0.60-1 per 1M tokens with sub-second TTFT is the strong default. Together's catalogue is curated (~200 models), which means popular Llama, Mistral, Qwen, and DeepSeek variants stay warm and consistently fast.

  • Per-token pricing — predictable cost on chat / RAG
  • Warm endpoints for popular Llama, Mistral, Qwen models
  • ~$0.60-1 per 1M tokens for Llama 3.3 70B
  • Mature LoRA + full fine-tuning workflows
  • OpenAI-compatible API across the catalogue

When to choose Replicate

Choose Replicate when you need community models, image / audio / video generation, or rarely-used LLM variants that Together does not host. Replicate's catalogue includes thousands of community-contributed models, including SDXL, FLUX, Whisper, MusicGen, and many fine-tuned Llamas. Per-second billing is more expensive on hot LLMs but pays off on bursty, mixed-modality workloads.

  • Thousands of community models including image, audio, video
  • Per-second GPU billing — natural fit for bursty workloads
  • Easy community fine-tunes (push-button deployment)
  • Best home for SDXL, FLUX, Whisper, MusicGen in 2026
  • OpenAI-compatible API for popular LLMs

Run Together AI and Replicate side-by-side

VerticalAPI lets you switch between Together AI and Replicate per-request through a single OpenAI-compatible endpoint. Use Together for predictable per-token chat workloads; use Replicate for community models and multimodal generation. Same SDK, same API key, zero markup — you pay Together and Replicate directly with your own keys (BYOK).

from openai import OpenAI
client = OpenAI(base_url="https://api.verticalapi.com/v1", api_key="vapi_...")

# Together AI — predictable per-token LLM inference
resp_x = client.chat.completions.create(
    model="together/llama-3.3-70b",
    messages=[{"role": "user", "content": "Classify customer feedback at scale"}],
    extra_headers={"X-Provider-Key": "tg-..."},
)

# Replicate — community + multimodal models
resp_y = client.chat.completions.create(
    model="replicate/flux-1.1-pro",
    messages=[{"role": "user", "content": "Generate a marketing image with FLUX"}],
    extra_headers={"X-Provider-Key": "rp-..."},
)

Try VerticalAPI free →

VerticalAPI verdict

Use Together AI when chat / LLM workloads on flagship open models need predictable per-token cost and warm endpoints. Use Replicate when community models, image / audio / video generation, or rare LLM variants drive the decision. Through VerticalAPI you can route between both with a single OpenAI-compatible endpoint and BYOK — no SDK migration.

Get started — BYOK both providers →

Frequently asked questions

Is Together AI cheaper than Replicate for Llama 3.3 70B?

For sustained, high-throughput chat workloads, yes — Together's per-token pricing (~$0.60-1 per 1M tokens) is typically cheaper than Replicate's per-second billing on the same hardware. For bursty workloads with short total compute time, per-second pricing on Replicate can be competitive. Modelling total cost on representative traffic is the only reliable way to compare.

Which has more models?

Replicate. It hosts thousands of community-contributed models across text, image, audio, and video — far more than Together's curated ~200 LLM catalogue. If your product needs SDXL, FLUX, Whisper, MusicGen, or a niche fine-tuned Llama variant, Replicate is usually the only place that hosts it cheaply.

Which is faster on flagship LLMs?

Together AI. Together keeps flagship LLMs (Llama 3.3 70B, Mistral, Qwen) warm on hosted endpoints with sub-second TTFT. Replicate also offers warm endpoints for popular LLMs, but the long tail of community models can suffer cold-starts of several seconds on the first request. For latency-sensitive chat, Together is the safer default.

Can I fine-tune on both?

Yes, but the workflows differ. Together AI supports LoRA and full fine-tuning across a wide range of base models with mature tooling. Replicate offers community-style fine-tunes that are easy to push and share but typically LoRA-only and with less control over training hyperparameters. For production fine-tunes, Together is the more mature choice.

Can I call both Together and Replicate through one endpoint?

Yes. VerticalAPI exposes a single OpenAI-compatible endpoint at https://api.verticalapi.com/v1. You send the same request shape and change the model parameter (for example, together/llama-3.3-70b or replicate/llama-3.3-70b) and the matching X-Provider-Key header. There is no markup on tokens; you pay Together AI and Replicate directly using your own keys (BYOK).

Limitations of this comparison

  • Per-second versus per-token cost comparison depends on input/output ratio and traffic burstiness — modelling representative traffic is the only reliable approach.
  • Replicate cold-start latency on rare models is highly variable and not always documented.
  • Together AI catalogue is curated, so a brand-new community fine-tune may not be hosted there at all.
  • Multimodal billing (image, audio, video) does not collapse cleanly onto per-token comparisons.
  • This page covers the two leading providers in their respective niches. Fireworks, DeepInfra, and Lepton overlap on parts of the catalogue.

What may change in 12-24 months

  1. Replicate is expected to push down per-second pricing on flagship LLMs to compete more directly with Together on chat workloads.
  2. Together is likely to add a per-second billing tier for non-LLM models to capture some of Replicate's multimodal traffic.
  3. Community model curation will become a key differentiator — both providers are investing in security and quality reviews.
  4. Provider lock-in will weaken further as OpenAI-compatible gateways (including VerticalAPI) make swapping open-inference providers a one-line change.

Related questions

ChatGPT, Perplexity and Gemini usually suggest these next.

  • How does Together AI compare to Fireworks for function-calling agents?
  • Is Replicate cheaper than Fireworks for image generation in 2026?
  • What is the cheapest open-inference provider for Llama 3.3 70B in 2026?
  • When does per-second GPU billing beat per-token billing?
  • How do Together AI fine-tuned LoRAs compare to Replicate community fine-tunes?