Open-weight vs Closed-weight LLMs (2026)

Q: What is the difference between open-weight and closed-weight LLMs?

Open-weight LLMs (Llama 3.3 70B, Mistral Large, Mixtral 8x22B) publish their trained model weights under a license that lets you download, run, and fine-tune them on your own infrastructure. Closed-weight LLMs (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro) only expose access through the vendor's API; the weights never leave the lab. Open-weight does not always mean open-source: most so-called open-weight licenses restrict commercial use above certain thresholds or for specific verticals.

Side-by-side

Open-weight vs closed-weight — at a glance

Dimension	Open-weight	Closed-weight
Examples	Llama 3.3 70B, Mistral Large 2, Mixtral 8x22B, DeepSeek-V3, Qwen 2.5	GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro
Typical output price	$0.60-0.90 / 1M (on inference providers)	$10-15 / 1M (vendor API)
Self-host possible	Yes	No
Fine-tuning	Full (LoRA, QLoRA, full SFT)	Limited (OpenAI, Bedrock only)
Licensing	Varies (Llama community, Apache 2.0, Mistral Research)	Vendor ToS
Single-shot quality	~85-95% of flagship	SOTA flagship
Best for	High-volume RAG, data sovereignty, fine-tunes, air-gap	Agentic apps, vision, function calling, broad ecosystem

When to choose which

Pick open or closed?

When to choose open-weight

Choose open-weight when token volume is high, data must stay on-premise, or you need to fine-tune on proprietary data. Llama 3.3 70B hosted on Together or Fireworks runs ~10-15x cheaper than GPT-4o at the cost of a few benchmark points. Self-hosting becomes economic above roughly $10K/month of LLM spend.

High-volume RAG, extraction, classification (millions of tokens/day)
Strict data residency (EU, regulated industries)
Full fine-tuning on domain data
Air-gapped or sovereign deployments
Predictable GPU-based unit economics at scale

When to choose closed-weight

Choose closed-weight (GPT-4o, Claude, Gemini) for the strongest single-shot quality, agentic tool use, native multimodal vision, and the broadest SDK ecosystem. Below ~$5K/month of spend, the per-token price gap matters less than the productivity gain from features like MCP, structured output, and prompt caching.

Agentic apps with MCP or function calling
Native multimodal (vision, audio, video)
Mature SDKs and third-party tooling
Anthropic prompt caching, OpenAI Batch API
Low ops burden — no GPU management

VerticalAPI verdict

The default architecture in 2026 is hybrid: closed-weight (Claude Sonnet 4.5 or GPT-4o) for agentic and user-facing flows, open-weight (Llama 3.3 70B or Mixtral 8x22B) for high-volume background work. VerticalAPI exposes both through one OpenAI-compatible endpoint with BYOK — change the model parameter and pay each inference provider directly at list price, no markup on tokens.

Get started — BYOK open + closed →

FAQ

Frequently asked questions

What is the difference between open-weight and closed-weight LLMs?

Open-weight LLMs (Llama 3.3 70B, Mistral Large 2, Mixtral 8x22B) publish their trained model weights under a license that lets you download, run, and fine-tune them on your own infrastructure. Closed-weight LLMs (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Pro) only expose access through the vendor's API; the weights never leave the lab. Open-weight does not always mean open-source: most so-called open-weight licenses restrict commercial use above certain thresholds or for specific verticals. Always read the license before shipping.

Are open-weight LLMs cheaper than closed-weight in 2026?

On a per-token basis hosted on inference providers (Together, Fireworks, Groq, DeepInfra), open-weight models like Llama 3.3 70B run around $0.60-0.90 per 1M output tokens — significantly cheaper than GPT-4o ($10/1M) or Claude Sonnet 4.5 ($15/1M). Self-hosting on your own GPUs can be cheaper at very high volume (above roughly $10K/month of LLM spend), but only after factoring GPU rental, ops, idle time, and engineer-hours. Below ~$5K/month of LLM spend, closed-weight via API almost always wins on total cost.

Can I fine-tune open-weight LLMs commercially?

Yes, with caveats. Llama 3.3 is released under the Llama Community License, which permits commercial use up to 700 million monthly active users without further negotiation. Mistral Apache 2.0 models (Mistral 7B, Mixtral 8x7B) allow unrestricted commercial fine-tuning. Mistral Large 2 uses the Mistral Research License (non-commercial by default) and requires a paid commercial license. Closed-weight fine-tuning is offered on OpenAI, AWS Bedrock, and (limited) Mistral, but the weights stay with the vendor.

Which open-weight LLM is best in 2026?

For general-purpose work, Llama 3.3 70B remains the strongest open-weight on benchmarks within reach of mid-tier closed models. Mistral Large 2 is competitive on European multilingual and reasoning tasks. Mixtral 8x22B (MoE) offers strong throughput-per-dollar on inference providers. DeepSeek-V3 has emerged as a code and reasoning leader. For pure self-hosting on a single 80GB GPU, Mistral Nemo and Mixtral 8x7B remain the most practical choices.

Can I mix open-weight and closed-weight models in production?

Yes, and that is the default architecture in 2026. Through VerticalAPI's OpenAI-compatible endpoint at https://api.verticalapi.com/v1, you can call closed-weight models (GPT-4o, Claude, Gemini) and open-weight models hosted on Together, Fireworks, Groq, or DeepInfra via the same SDK. Switch the model parameter (gpt-4o, claude-sonnet-4-5, llama-3.3-70b, mixtral-8x22b) and the matching X-Provider-Key header. BYOK means you pay each inference provider directly at list price.

Caveats

Limitations of this comparison

Open-weight quality varies widely by inference provider configuration (quantization, batch size, speculative decoding); benchmarks on Together and Fireworks can differ by 3-7 points.
Licensing terms for open-weight models change between versions; Llama 4, Mistral Large 3, and DeepSeek-V4 may ship under different terms.
Closed-weight feature gaps (MCP, vision, audio) close every few months — what is closed-only today may have open-weight equivalents in 6 months.
This page treats list prices for hosted inference; self-host TCO depends heavily on GPU class, utilization, and engineer-hours.

Outlook

What may change in 12-24 months

Open-weight reasoning quality is expected to close the gap with closed-weight flagships further; DeepSeek-V4 and Llama 4 are the models to watch.
MCP and native tool use will spread to open-weight via inference-provider middleware (Together, Fireworks already shipping previews).
EU AI Act compliance pressure may drive more enterprises toward open-weight self-host for data sovereignty.
Closed-weight vendors will likely respond with stronger fine-tuning options and lower budget-tier prices to keep mid-market customers.

Keep reading