Build a chatbot via VerticalAPI
Building a production chatbot in 2026 means picking the right model for tone, latency and cost — and switching between providers as your traffic patterns evolve. VerticalAPI gives you one OpenAI-compatible endpoint that you can repoint at GPT-4o, Claude Haiku, Gemini Flash or open-weights Llama without changing your code. Here's the recommended setup.
Best models for this use case
Claude Haiku 4.5
Cheapest agentic chat model with strong tool use ($0.80 / $4 per 1M tok)
View Claude Haiku 4.5 integration →Gemini 2.5 Flash
Generous free tier, fast TTFT, multimodal in if you need it ($0.30 / $2.50)
View Gemini 2.5 Flash integration →Llama 3.3 70B (Groq)
Sub-100ms latency for typing-feel UX; cheapest open-weights chat
View Llama 3.3 70B (Groq) integration →How it fits together
Frontend (React/Next) → POST /api/chat → your backend → VerticalAPI /v1/chat/completions (streaming) → token-streamed response back to frontend. Add a Redis cache for system prompts and a per-user rate limit at your edge.
Working example in python
from openai import OpenAI
client = OpenAI(
base_url="https://api.verticalapi.com/v1",
api_key="vapi_...",
default_headers={"X-Provider-Key": "sk-ant-..."}
)
response = client.chat.completions.create(
model="claude-haiku-4-5",
messages=[
{"role": "system", "content": "You are Acme's friendly support bot."},
{"role": "user", "content": user_message}
],
stream=True,
)
for chunk in response:
yield chunk.choices[0].delta.content or ""Typical cost at production volume
Typical chatbot serving 100K conversations/month at ~10 turns each (~500 tokens per turn) costs roughly $50-150/month on Claude Haiku 4.5, $20-60 on Gemini Flash, or $30-80 on Llama 3.3 70B via Groq. <!-- TODO Hugo: refine with actual conversion data -->
Common questions
Should I stream responses?
Yes — streaming reduces perceived latency dramatically. VerticalAPI passes streaming through unchanged; the OpenAI SDK's stream=True works on every supported provider.
How do I switch model based on user tier?
Pass model='claude-sonnet-4-5' for paid users, model='claude-haiku-4-5' for free. Same endpoint, same auth, just a different model field.
What about tool calling / function calling?
Standard OpenAI tools[] array works on Claude, GPT-4o, Gemini, Mistral, Llama (via Together/Groq/Fireworks). VerticalAPI normalizes provider differences so your tool definitions are portable.
Other use cases
I want to build retrieval-augmented generation (RAG)
I want to build an agentic / autonomous LLM workflow
I want consistent tool-calling behavior across multiple LLM providers
I want to send images, audio or video to an LLM