Build a RAG application via VerticalAPI

Recommended providers

Best models for this use case

Cohere Embed v3 + Rerank 3.5

Best-in-class embedding + reranker combo, RAG-tuned ($0.10 / 1M tok embed)

View Cohere Embed v3 + Rerank 3.5 integration →

Claude Sonnet 4.5 (long-context)

200K context lets you skip aggressive chunking; native citations

View Claude Sonnet 4.5 (long-context) integration →

Gemini 2.5 Pro (2M context)

Eliminate retrieval entirely for medium-size corpora — just stuff context

View Gemini 2.5 Pro (2M context) integration →

Architecture

How it fits together

Indexing job: docs → chunker → Cohere Embed v3 → pgvector/Pinecone. Query path: question → embed → vector search → Cohere rerank → top-K chunks → Claude / GPT generation with citations. Add Redis cache on system prompts for cheap repeated calls.

Code example

Working example in python

rag.pythonPython
from openai import OpenAI

client = OpenAI(
    base_url="https://api.verticalapi.com/v1",
    api_key="vapi_...",
    default_headers={"X-Provider-Key": "sk-ant-..."}
)

# 1. Embed query (Cohere via VerticalAPI)
query_emb = client.embeddings.create(
    model="embed-english-v3",
    input=user_question
).data[0].embedding

# 2. Vector search (your DB — pgvector, Pinecone, ...)
chunks = vector_db.search(query_emb, k=20)

# 3. Rerank (Cohere) and pick top 5
top_chunks = rerank(chunks, user_question)[:5]

# 4. Generate with citations
response = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{
        "role": "user",
        "content": f"Answer using these sources:\n\n{top_chunks}\n\nQ: {user_question}"
    }]
)

Pricing estimate

Typical cost at production volume

A RAG app serving 50K queries/month with 5 reranked chunks of ~1K tokens each costs roughly $30-90/month: ~$5 embedding (Cohere), ~$10 rerank, $15-75 generation depending on model. Prompt caching on Claude can cut generation cost by 60-80%.

See VerticalAPI plan pricing →

FAQ

Common questions

Should I use long-context instead of RAG?

If your corpus fits in 200K-2M tokens and queries hit most of it, long-context (Claude Sonnet 4.5 or Gemini 2.5 Pro 2M) often beats RAG on quality and is simpler. RAG wins when corpus is large, queries are narrow, or you need provenance / citations.

Can I use Cohere embeddings with Claude generation?

Yes — that's the killer combo. VerticalAPI handles both providers behind one endpoint: Cohere for embed/rerank, Claude for generation. Add prompt caching to make repeated context cheap.

What's the cheapest embedding option?

Cohere Embed v3 ($0.10 / 1M tokens), text-embedding-3-small (OpenAI, $0.02 / 1M), or BGE on DeepInfra (typically <$0.05). Pick based on retrieval quality on your eval set, not raw price.

More guides

Other use cases

Build a chatbot via VerticalAPI

I want to build a customer-facing chatbot

Read guide →

Build autonomous agents via VerticalAPI

I want to build an agentic / autonomous LLM workflow

Read guide →

Function calling and tool use across providers

I want consistent tool-calling behavior across multiple LLM providers

Read guide →

Vision + text via VerticalAPI

I want to send images, audio or video to an LLM

Read guide →