GPU Economics: What Inference Actually Costs in 2026
The question every AI team eventually asks: should we rent GPUs and run models ourselves, or just pay per token through an API?
The answer changed a lot in the last six months. GPU rental prices dropped. API prices dropped faster. New GPU generations shipped. And mixture-of-experts models made the whole calculation messier than it used to be.
Here's the actual math, with real numbers from real providers.
GPU rental prices right now
These are on-demand, publicly listed prices as of February 2026. No negotiated enterprise deals, no reserved instances.
| GPU | Provider | Config | $/hour | VRAM (GB) |
|---|---|---|---|---|
| NVIDIA B200 | CoreWeave | 8x GPU | $68.80 | 180 |
| NVIDIA GB200 NVL72 | CoreWeave | 4-GPU slice | $42.00 | 186 |
| NVIDIA HGX H200 | CoreWeave | 8x GPU | $50.44 | 141 |
| NVIDIA HGX H100 | CoreWeave | 8x GPU | $49.24 | 80 |
| NVIDIA GH200 | CoreWeave | 1x GPU | $6.50 | 96 |
| NVIDIA A100 80GB | CoreWeave | 8x GPU | $21.60 | 80 |
| NVIDIA L40S | CoreWeave | 8x GPU | $18.00 | 48 |
| NVIDIA RTX PRO 6000 | CoreWeave | 8x GPU | $20.00 | 96 |
A few things stand out. The B200 costs 40% more than the H100 per hour, but delivers roughly 2.5x the inference throughput for large models according to NVIDIA's own benchmarks. The H200 is barely more expensive than the H100 despite having 76% more VRAM. And the A100 — which was the default choice 18 months ago — is now less than half the price of current gen.
CoreWeave dominates GPU cloud pricing. They're seeking an $8.5B loan backed by a Meta contract worth up to $14.2B, which tells you the scale of demand here.
What does it cost to serve a model yourself?
Let's do the math on running Llama 3.1 405B — a model big enough to compete with GPT-5 mini on most benchmarks, and the most common choice for self-hosted production deployments.
Hardware requirement: 405B parameters at FP8 precision need roughly 405GB of VRAM. That's a minimum of 6x H100 80GB GPUs, or more practically an 8-GPU H100 node.
Hourly cost on CoreWeave: $49.24/hr for 8x H100.
Throughput: with vLLM and continuous batching, expect roughly 2,000-3,000 output tokens per second on an 8x H100 setup running Llama 405B at FP8. Call it 2,500 tok/s as a conservative estimate based on vLLM benchmarks.
Cost per million output tokens:
- 2,500 tokens/second = 9,000,000 tokens/hour
- $49.24 / 9M tokens = $5.47 per million output tokens
Compare that to API pricing for models in this class:
| Model | Provider | Input $/M | Output $/M | Source |
|---|---|---|---|---|
| Llama 3.1 405B | Together AI | $3.50 | $3.50 | Serverless |
| DeepSeek-R1-0528 | Together AI | $3.00 | $7.00 | Serverless |
| GPT-5 mini | OpenAI | $0.25 | $2.00 | API |
| GPT-5.2 | OpenAI | $1.75 | $14.00 | API |
| Gemini 3 Pro | Google Cloud | $2.00 | $12.00 | Vertex AI |
| Gemini 3 Flash | Google Cloud | $0.50 | $3.00 | Vertex AI |
| Qwen3.5-397B-A17B | Together AI | $0.60 | $3.60 | Serverless |
Self-hosting Llama 405B at $5.47/M output tokens is more expensive than calling Together AI's API for the same model at $3.50/M. That's the efficiency of shared infrastructure at scale. Together AI batches requests from thousands of customers across the same GPUs. You're paying for idle time; they're not.
When self-hosting wins
The math flips in three scenarios.
First, when you're running at near-100% capacity. If your inference demand is constant and maxes out the hardware — say, a consumer product doing millions of requests per day — your effective per-token cost drops because you're eliminating idle time. At 90%+ load, self-hosted Llama 405B drops to roughly $4.00/M output. Still not cheaper than Together AI's serverless rate, but cheaper than OpenAI's GPT-5.2 at $14.00/M.
Second, data isolation. Some industries (healthcare, defense, finance) can't send prompts to third-party APIs. The premium you pay for self-hosting is really a compliance cost. CoreWeave and Lambda offer single-tenant nodes for this.
Third, smaller models. A 7B or 8B model on a single L40S ($2.25/hr for one GPU) can push 10,000+ tokens/second. That works out to about $0.06/M output tokens — roughly matching the cheapest API options like Llama 3.2 3B at $0.06/M on Together AI. But if you're running a fine-tuned version of that model, the API option doesn't exist.
When APIs win
For most teams, most of the time. Here's why.
Mixture-of-experts models destroyed the self-hosting value proposition. Qwen3.5-397B has 397B total parameters but only activates 17B per token. Together AI charges $0.60/M input and $3.60/M output for it. Running it yourself requires enough VRAM to hold all 397B parameters even though you're only using 17B at inference. You're paying for dead weight.
The same applies to DeepSeek V3.1 ($0.60/$1.70 on Together AI), Llama 4 Maverick ($0.27/$0.85), and most new open models shipping with MoE architectures. API providers handle the memory overhead across a shared fleet. You'd handle it alone.
Batch pricing cuts costs in half. OpenAI's Batch API gives you 50% off both input and output tokens in exchange for 24-hour turnaround. For non-realtime workloads — data processing, content generation, analysis pipelines — that brings GPT-5 mini down to $0.125/$1.00. No GPU rental comes close for a model of that quality.
You don't need to hire anyone. Running inference infrastructure requires MLOps engineers. Kubernetes. Monitoring. Model updates. Quantization debugging. One senior ML infra engineer costs $200K+/year. That's equivalent to roughly 4,000 H100-hours at CoreWeave, or about 36 trillion tokens through GPT-5 mini's API.
The Blackwell generation changes the math (slightly)
NVIDIA's B200 delivers roughly 2.5x the inference throughput of an H100 for FP8 workloads. At $68.80/hr for 8x B200 on CoreWeave versus $49.24 for 8x H100, you're paying 40% more for 2.5x the throughput. Per-token cost drops by about 44%.
That brings self-hosted Llama 405B on B200s down to roughly $3.10/M output tokens — finally competitive with Together AI's API rate. But B200 availability is still constrained. CoreWeave's GB200 NVL72 (the rack-scale option at $42/hr for a 4-GPU slice) adds even more memory bandwidth, but at 186GB VRAM per slice it's sized for models under 200B parameters.
For teams that can get B200 allocations and run at high capacity, self-hosting starts to make financial sense again. For everyone else, the API gap keeps widening.
The real cost nobody talks about
Electricity. A single 8x H100 node draws about 10.2 kW under load. At US commercial electricity rates ($0.12/kWh average from the EIA), that's $1.22/hr just for power — roughly 2.5% of the CoreWeave rental price. Not a big deal for cloud renters.
But if you're Meta building out data centers that consume gigawatts, or CoreWeave financing $8.5B in infrastructure, power becomes the constraint that sets the floor on how cheap inference can get. Big Tech is projected to invest $650B in AI infrastructure in 2026, and a meaningful chunk of that is electricity and cooling.
Bottom line
For teams processing fewer than 10B tokens per month, APIs are cheaper, simpler, and better maintained. GPT-5 mini at $0.25/$2.00 or Qwen3.5-397B at $0.60/$3.60 will outperform anything you self-host at the same cost.
For teams above 10B tokens/month with consistent demand, self-hosting on B200s starts to pencil out — but only if you have the engineering team to run it and can tolerate the 3-6 month wait for hardware allocation.
The interesting middle ground is dedicated endpoints from providers like Together AI and Fireworks, where you rent reserved GPU capacity but the provider handles the stack. You get lower per-token costs than serverless without the ops overhead. That's where most serious production deployments end up.
If you want to see how the API prices compare across all major providers, we maintain an updated table in our LLM pricing comparison. And for context on which models are worth running in the first place, see our open source vs proprietary LLM analysis.
We publish data-driven analysis on AI infrastructure, pricing, and adoption every week. Subscribe to get it in your inbox.
Kael Tiwari
AI market intelligence for investors and founders