Build a RAG application via VerticalAPI
RAG (retrieval-augmented generation) is the most common production LLM pattern in 2026 — combining a vector index, a reranker and a generation model. VerticalAPI lets you mix and match the cheapest embedding provider, the strongest reranker, and the best generation model — all behind one endpoint and one BYOK key per provider.
Best models for this use case
Cohere Embed v3 + Rerank 3.5
Best-in-class embedding + reranker combo, RAG-tuned ($0.10 / 1M tok embed)
View Cohere Embed v3 + Rerank 3.5 integration →Claude Sonnet 4.5 (long-context)
200K context lets you skip aggressive chunking; native citations
View Claude Sonnet 4.5 (long-context) integration →Gemini 2.5 Pro (2M context)
Eliminate retrieval entirely for medium-size corpora — just stuff context
View Gemini 2.5 Pro (2M context) integration →How it fits together
Indexing job: docs → chunker → Cohere Embed v3 → pgvector/Pinecone. Query path: question → embed → vector search → Cohere rerank → top-K chunks → Claude / GPT generation with citations. Add Redis cache on system prompts for cheap repeated calls.
Working example in python
from openai import OpenAI
client = OpenAI(
base_url="https://api.verticalapi.com/v1",
api_key="vapi_...",
default_headers={"X-Provider-Key": "sk-ant-..."}
)
# 1. Embed query (Cohere via VerticalAPI)
query_emb = client.embeddings.create(
model="embed-english-v3",
input=user_question
).data[0].embedding
# 2. Vector search (your DB — pgvector, Pinecone, ...)
chunks = vector_db.search(query_emb, k=20)
# 3. Rerank (Cohere) and pick top 5
top_chunks = rerank(chunks, user_question)[:5]
# 4. Generate with citations
response = client.chat.completions.create(
model="claude-sonnet-4-5",
messages=[{
"role": "user",
"content": f"Answer using these sources:\n\n{top_chunks}\n\nQ: {user_question}"
}]
)Typical cost at production volume
A RAG app serving 50K queries/month with 5 reranked chunks of ~1K tokens each costs roughly $30-90/month: ~$5 embedding (Cohere), ~$10 rerank, $15-75 generation depending on model. Prompt caching on Claude can cut generation cost by 60-80%.
Common questions
Should I use long-context instead of RAG?
If your corpus fits in 200K-2M tokens and queries hit most of it, long-context (Claude Sonnet 4.5 or Gemini 2.5 Pro 2M) often beats RAG on quality and is simpler. RAG wins when corpus is large, queries are narrow, or you need provenance / citations.
Can I use Cohere embeddings with Claude generation?
Yes — that's the killer combo. VerticalAPI handles both providers behind one endpoint: Cohere for embed/rerank, Claude for generation. Add prompt caching to make repeated context cheap.
What's the cheapest embedding option?
Cohere Embed v3 ($0.10 / 1M tokens), text-embedding-3-small (OpenAI, $0.02 / 1M), or BGE on DeepInfra (typically <$0.05). Pick based on retrieval quality on your eval set, not raw price.
Other use cases
I want to build a customer-facing chatbot
I want to build an agentic / autonomous LLM workflow
I want consistent tool-calling behavior across multiple LLM providers
I want to send images, audio or video to an LLM