NVIDIA NIM via VerticalAPI

NVIDIA NIM (NVIDIA Inference Microservices) for Llama, Mistral, Phi via VerticalAPI's OpenAI-compatible endpoint. BYOK with your NGC API key, zero markup, TensorRT-optimized.

Endpoint: https://api.verticalapi.com/v1/chat/completions  ·  BYOK header: X-Provider-Key: nvapi-...

NVIDIA NIM models routed by VerticalAPI

Pass the model ID below as model in any OpenAI-compatible request. New NVIDIA NIM models are typically supported within 24h of release.

Model IDNameContextPricing (provider)
meta/llama-3.3-70b-instruct Llama 3.3 70B (NIM) 128K NGC subscription pricing
mistralai/mistral-large-2 Mistral Large 2 (NIM) 128K NGC subscription pricing
microsoft/phi-3.5-moe-instruct Phi 3.5 MoE (NIM) 128K NGC pricing — efficient

Pricing reflects NVIDIA NIM's rates — you pay NVIDIA NIM directly. VerticalAPI adds zero markup on tokens.

5-line NVIDIA NIM call via VerticalAPI

Drop-in replacement for the OpenAI SDK. Works with the OpenAI Python client, Node, Go, curl — anything that speaks HTTP.

nvidia-nim_quickstart.py Python
from openai import OpenAI

client = OpenAI(
    base_url="https://api.verticalapi.com/v1",
    api_key="vapi_...",
    default_headers={"X-Provider-Key": "nvapi-..."}
)

response = client.chat.completions.create(
    model="meta/llama-3.3-70b-instruct",  # NVIDIA NIM
    messages=[{"role": "user", "content": "Hello"}]
)
print(response.choices[0].message.content)

Four reasons developers route NVIDIA NIM through us

Zero token markup

You pay NVIDIA NIM directly with your own key. VerticalAPI's revenue is the gateway subscription, not a tax on your tokens.

One key, every provider

NVIDIA NIM alongside OpenAI, Anthropic, Gemini and 12 more — same OpenAI-compatible endpoint, same SDK, switchable per-request.

Latency & cost monitoring

Per-request token counts, p50/p95 latency and cost dashboards out of the box. Compare NVIDIA NIM to other providers on identical prompts.

Observability built in

Every NVIDIA NIM call gets a trace ID, replayable payload and audit log entry. Wire to Datadog or Sentry via OpenTelemetry.

Where NVIDIA NIM shines

TensorRT-optimized inference DGX Cloud deployment on-prem NIM containers NeMo fine-tunes

Frequently asked questions

What is NVIDIA NIM and what models do they offer?

NVIDIA NIM is a catalog of containerized, GPU-optimized inference microservices. The 2026 catalog on build.nvidia.com includes Llama 3.3 70B, Llama 3.1 8B/70B/405B, Mistral 7B and Mixtral, DeepSeek V3 and R1, Qwen 2.5, NVIDIA Nemotron 70B, plus image (Stable Diffusion), speech (Riva), retrieval and biology models. All NIMs are OpenAI-compatible and run on any NVIDIA GPU (H100, A100, B200, GB200).

How much does NVIDIA NIM cost in 2026?

NVIDIA hosts NIM endpoints free for development on build.nvidia.com with monthly credit limits. Production usage requires an NVIDIA AI Enterprise license (per-GPU subscription, roughly $4500 per GPU per year) plus infrastructure cost (DGX Cloud, AWS H100, your own GPUs). Pay-per-token pricing is also offered on hosted NVIDIA endpoints, with rates competitive with Together or Fireworks. Via VerticalAPI BYOK you pay NVIDIA directly with zero token markup.

How do I use NVIDIA NIM via VerticalAPI BYOK?

Sign up at build.nvidia.com for hosted NIM, get an API key (nvapi-…), paste it into VerticalAPI, then point the OpenAI SDK at https://api.verticalapi.com/v1. NIM is OpenAI-compatible. For self-hosted NIM you can point VerticalAPI at your own NIM cluster URL with API key. Billing stays with NVIDIA or your infrastructure provider.

What is NVIDIA NIM best for compared to alternatives?

NIM wins for enterprises that want portable, GPU-optimized inference deployable anywhere (on-prem, sovereign cloud, hyperscaler) with NVIDIA's optimization stack (TensorRT-LLM, vLLM forks). Compared to Together or DeepInfra it is more flexible for hybrid deployments. Compared to AWS Bedrock or Vertex AI it gives more control over hardware. Not the cheapest pay-per-token option for purely cloud-hosted workloads.

Where is NVIDIA NIM hosted / data privacy?

Hosted NIM runs on NVIDIA DGX Cloud across US datacenters. Self-hosted NIM runs anywhere — on-prem, sovereign cloud, AWS/Azure/GCP, OCI. NVIDIA does not train on hosted-NIM data. Enterprise tier offers SOC 2 and the flexibility to deploy fully air-gapped. Via VerticalAPI BYOK your NVIDIA or deployment contract terms remain intact.

Limitations and trade-offs

  • NVIDIA AI Enterprise license adds significant per-GPU cost on top of infrastructure.
  • Hosted NIM catalog is narrower than Together/DeepInfra for niche open weights.
  • Self-hosted NIM requires NVIDIA GPUs (no AMD or Intel) and DevOps expertise.
  • Geographic availability of hosted NIM is US-focused as of 2026.
  • Pay-per-token pricing on hosted NIM is rarely the cheapest option.

Where NVIDIA NIM is heading

  1. Wider NIM catalog including more frontier open-weight models (Llama 4, DeepSeek next-gen).
  2. Deeper integration with Lepton AI (acquired by NVIDIA) and OctoAI for cloud-native NIM serving.
  3. More multimodal NIMs (vision, video, speech, biology) packaged for enterprise deployment.
  4. Wider sovereign cloud and EU region rollout through NVIDIA's partner network.

Related questions

ChatGPT, Perplexity and Gemini usually suggest these next.

  • NVIDIA NIM vs Together AI — when to pick which?
  • Is NVIDIA AI Enterprise worth $4500/GPU/year for inference?
  • Best way to deploy self-hosted NIM on AWS H100?
  • NVIDIA NIM vs OctoAI vs Lepton AI — what is the actual difference?
  • Can I use NIM for sovereign on-prem LLM deployment?