Vision + text via VerticalAPI

Recommended providers

Best models for this use case

GPT-4o

Strong general vision; image+text+audio in one model

View GPT-4o integration →

Gemini 2.5 Pro

Native video in (multi-hour), 2M-token context for long PDFs

View Gemini 2.5 Pro integration →

Claude Sonnet 4.5

Best for screenshot + UI reasoning, document QA on PDFs

View Claude Sonnet 4.5 integration →

Architecture

How it fits together

Client uploads image → your backend (consider thumbnail to ~1024px to save input tokens) → VerticalAPI /v1/chat/completions with image_url part → provider returns answer. For video on Gemini, pass frames via inline_data or hosted URL — VerticalAPI handles either.

Code example

Working example in python

multimodal.pythonPython
from openai import OpenAI
import base64

client = OpenAI(base_url="https://api.verticalapi.com/v1", api_key="vapi_...")

# 1. As HTTPS URL
response = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
        ]
    }]
)

# 2. As base64 (works on every provider)
img_b64 = base64.b64encode(open("chart.png", "rb").read()).decode()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the data table."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }]
)

Pricing estimate

Typical cost at production volume

Image-in tokens are typically counted as ~85-1500 tokens per image (resolution-dependent). A vision-heavy app processing 10K images/month at GPT-4o pricing costs roughly $5-30/month for the image tokens. Gemini Flash is ~5x cheaper for the same workload.

See VerticalAPI plan pricing →

FAQ

Common questions

Can I send video to Gemini through VerticalAPI?

Yes. Gemini 2.5 Pro accepts video as input — pass either a Files API URL or base64 frames. VerticalAPI translates from OpenAI's vision message shape to Gemini's video parts.

What about audio in/out (voice agents)?

GPT-4o audio mode and Whisper (via Groq, OpenAI, Replicate) are routable. For full duplex voice agents, pair Whisper transcription + Claude/GPT chat + an output TTS like ElevenLabs (outside VerticalAPI's scope).

Do open-weights models do vision too?

Yes — Llama 3.2 90B Vision (via Together, Bedrock, Groq) handles images. Quality is below GPT-4o / Claude / Gemini but cheaper at volume.

More guides