Vision + text via VerticalAPI

Multimodal LLMs (vision + text, audio in/out, video in) are now mainstream — but the input shape varies wildly across providers. VerticalAPI accepts the OpenAI vision format (image_url with data: or HTTPS URLs) and translates to each provider's native shape: Gemini's inline_data, Claude's image content blocks, Llama 3.2 Vision's parts.

How it fits together

Client uploads image → your backend (consider thumbnail to ~1024px to save input tokens) → VerticalAPI /v1/chat/completions with image_url part → provider returns answer. For video on Gemini, pass frames via inline_data or hosted URL — VerticalAPI handles either.

Working example in python

multimodal.pythonPython
from openai import OpenAI
import base64

client = OpenAI(base_url="https://api.verticalapi.com/v1", api_key="vapi_...")

# 1. As HTTPS URL
response = client.chat.completions.create(
    model="claude-sonnet-4-5",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}}
        ]
    }]
)

# 2. As base64 (works on every provider)
img_b64 = base64.b64encode(open("chart.png", "rb").read()).decode()
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Extract the data table."},
            {"type": "image_url", "image_url": {"url": f"data:image/png;base64,{img_b64}"}}
        ]
    }]
)

Typical cost at production volume

Image-in tokens are typically counted as ~85-1500 tokens per image (resolution-dependent). A vision-heavy app processing 10K images/month at GPT-4o pricing costs roughly $5-30/month for the image tokens. Gemini Flash is ~5x cheaper for the same workload.

See VerticalAPI plan pricing →

Common questions

Can I send video to Gemini through VerticalAPI?

Yes. Gemini 2.5 Pro accepts video as input — pass either a Files API URL or base64 frames. VerticalAPI translates from OpenAI's vision message shape to Gemini's video parts.

What about audio in/out (voice agents)?

GPT-4o audio mode and Whisper (via Groq, OpenAI, Replicate) are routable. For full duplex voice agents, pair Whisper transcription + Claude/GPT chat + an output TTS like ElevenLabs (outside VerticalAPI's scope).

Do open-weights models do vision too?

Yes — Llama 3.2 90B Vision (via Together, Bedrock, Groq) handles images. Quality is below GPT-4o / Claude / Gemini but cheaper at volume.