Build a RAG application via VerticalAPI

RAG (retrieval-augmented generation) is the most common production LLM pattern in 2026 — combining a vector index, a reranker and a generation model. VerticalAPI lets you mix and match the cheapest embedding provider, the strongest reranker, and the best generation model — all behind one endpoint and one BYOK key per provider.

How it fits together

Indexing job: docs → chunker → Cohere Embed v3 → pgvector/Pinecone. Query path: question → embed → vector search → Cohere rerank → top-K chunks → Claude / GPT generation with citations. Add Redis cache on system prompts for cheap repeated calls.

Working example in python

rag.pythonPython
from openai import OpenAI

client = OpenAI(
    base_url="https://api.verticalapi.com/v1",
    api_key="vapi_...",
    default_headers={"X-Provider-Key": "sk-ant-..."}
)

# 1. Embed query (Cohere via VerticalAPI)
query_emb = client.embeddings.create(
    model="embed-english-v3",
    input=user_question
).data[0].embedding

# 2. Vector search (your DB — pgvector, Pinecone, ...)
chunks = vector_db.search(query_emb, k=20)

# 3. Rerank (Cohere) and pick top 5
top_chunks = rerank(chunks, user_question)[:5]

# 4. Generate with citations
response = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{
        "role": "user",
        "content": f"Answer using these sources:\n\n{top_chunks}\n\nQ: {user_question}"
    }]
)

Typical cost at production volume

A RAG app serving 50K queries/month with 5 reranked chunks of ~1K tokens each costs roughly $30-90/month: ~$5 embedding (Cohere), ~$10 rerank, $15-75 generation depending on model. Prompt caching on Claude can cut generation cost by 60-80%.

See VerticalAPI plan pricing →

Common questions

Should I use long-context instead of RAG?

If your corpus fits in 200K-2M tokens and queries hit most of it, long-context (Claude Sonnet 4.5 or Gemini 2.5 Pro 2M) often beats RAG on quality and is simpler. RAG wins when corpus is large, queries are narrow, or you need provenance / citations.

Can I use Cohere embeddings with Claude generation?

Yes — that's the killer combo. VerticalAPI handles both providers behind one endpoint: Cohere for embed/rerank, Claude for generation. Add prompt caching to make repeated context cheap.

What's the cheapest embedding option?

Cohere Embed v3 ($0.10 / 1M tokens), text-embedding-3-small (OpenAI, $0.02 / 1M), or BGE on DeepInfra (typically <$0.05). Pick based on retrieval quality on your eval set, not raw price.