TL;DR: LLM inference prices drop 10x per year, but production token volume grows much faster, so total spend climbs exponentially. Output tokens cost 3-5x more than input tokens, agentic workflows amplify output dramatically, and the gap between pilot cost assumptions and production reality breaks most budgets. You bend the curve with model routing, output engineering, and GPU utilization discipline.
Key Takeaways: - Per-token pricing is a rate, not a bill. Volume, output multiplier, and concurrency dominate real spend. - Output tokens are 3-5x costlier than input across every major provider, and agentic loops amplify them further. - Cascade routing, structured outputs, and self-hosting at scale cut blended cost without quality loss.
The 10x-a-Year Headline Is Lying to Your Finance Team

Everyone quotes the 10x-per-year decline in LLM inference cost. Nobody talks about what happens to your bill when token volume grows 1000x in the same window. A16z's "LLMflation" thesis made the rounds for good reason.
For an LLM of equivalent performance, the cost drops 10x annually, faster than PC compute or dotcom bandwidth. When GPT-3 launched in late 2021, it was the only model that hit MMLU 42, at $60 per million tokens. Today, capable models run under $1 per million.
That headline is true. It's also a trap.
CTOs approve budgets based on per-million pricing. They compare rates, pick the cheapest vendor, and project forward. The math looks clean.
Then production hits, and the actual bill bears no resemblance to the spreadsheet. The reason is simple: per-token cost is a rate, and rates don't tell you what you spend. Volume does.
Production systems don't buy one million tokens. They buy hundreds of millions. As workflows grow agentic, with multi-step RAG, tool calls, and chain-of-thought, the volume curve steepens non-linearly.
Our LLM inference cost benchmarks confirm this pattern: volume, not the per-token rate, drives the actual bill.
Output Tokens Are the Silent Budget Killer
Here's the line item almost nobody budgets for. Output tokens cost 3-5x more than input tokens across every major provider. API pricing spans $0.25 to $15 per million input and $1.25 to $75 per million output.
Gemini 3.1 Flash charges $0.10 input and $0.40 output. The output-to-input ratio holds across every tier.
Why the gap? Output generation is compute-bound. Autoregressive decoding runs one token at a time.
Input processing is parallelizable across the full context. The hardware does fundamentally more work per output token.
Most cost optimization focuses on prompt compression and context trimming. That's the wrong target. Output verbosity grows unchecked.
Agentic loops, chain-of-thought reasoning, and multi-step RAG pipelines expand output token consumption per user request. A single customer support query that used to produce modest output now costs many times more. The model thinks out loud across multiple tool calls, inflating each interaction.
The model training and inference economics shift when output dominates the bill. If output tokens are the multiplier, what happens when you scale from prototype to production volume?
Why the 100K-to-100M Token Trajectory Breaks Most Budgets
Linear thinking is the silent killer. 100M tokens should cost 1000x the 100K pilot. In practice, production workloads add retrieval context, guardrails, retry logic, and orchestration. Each inflates token counts beyond any naive projection.
Production LLM cost is dominated by the interaction between model choice, output length, and concurrency. Not raw token volume. The same workload at 100K tokens costs a few dollars.
At 100M tokens, the naive projection says a few thousand. Real spend often lands at several times that. The multiplier comes from output expansion, infrastructure overhead, and the fact that production traffic spikes don't follow pilot patterns.
CTOs who modeled cost at 100K find their per-unit cost assumption was a prototype artifact. Not a production number. The pilot used a frontier model for everything.
Production needs routing, caching, fallback logic, and observability. Each adds tokens you didn't budget for. The production-scale inference cost patterns follow a predictable shape: flat for the first million, then steep.
So the curve is brutal, but it bends if you know where to apply force. Three levers actually move the number.
The Three Levers That Actually Bend the Cost Curve
Lever 1: Model routing. Gemini 3.1 Flash at $0.10 per million input handles the bulk of classification, extraction, and summarization tasks. Route only complex reasoning to frontier models. A classifier in front of your LLM call can reduce blended cost without measurable quality loss.
Lever 2: Output engineering. Structured outputs via JSON schemas and constrained decoding cut output tokens versus free-form generation. Stop sequences and `max_tokens` caps prevent runaway verbosity.
Most teams never set these. They let the model ramble until it hits the natural endpoint, which is often far longer than needed.
Lever 3: GPU utilization. Self-hosted breakeven depends on sustained utilization that justifies the fixed cost of ownership versus rented API capacity. GPUs are expensive to own and cheap to rent, until they aren't.
The mechanics of GPU utilization and inference efficiency decide which side of that line you land on.
Cascade architecture, where a cheap model handles the request first, is the highest-impact cost reduction most teams haven't built. It escalates to a frontier model only on a confidence threshold.
It works because classification and extraction don't need a 100B-parameter model. They need a 7B model with good routing.
Knowing the levers is half the battle. The other half is building a cost model your CFO will believe.
Building a Cost Model That Survives Contact with Production

Step 1: Benchmark your workload. Use NVIDIA GenAI-Perf or an equivalent tool. Measure throughput, latency, and tokens-per-request at realistic concurrency.
A model that benchmarks well on a single request can collapse under production load.
Step 2: Map latency constraints to required model instances. Plot the latency-throughput tradeoff curve. Calculate peak requests per second against your SLA target.
This tells you how many GPUs you need for self-hosting, or what concurrency tier you need for API.
Step 3: Model three scenarios: current volume, 10x growth, and 100x growth. Factor in output token expansion, which compounds for agentic workflows. The model that works at 100K tokens breaks at 100M because output behavior changes under load.
Step 4: Add infrastructure overhead. KV cache memory, attention compute, and embedding generation for RAG inflate TCO beyond API sticker price. We covered related LLM serving cost optimization patterns and the auto-scaling pitfalls that inflate GPU spend in earlier work.
A cost model built from production-shaped assumptions survives the transition from pilot to scale. The next question is where the line crosses between API and self-hosting.
The Self-Host vs API Decision: A Breakeven Framework
API wins when volume is bursty or unpredictable. When you need frontier capabilities you can't host. When your team lacks MLOps infrastructure for transformer serving and KV cache management.
Self-host wins when sustained GPU utilization justifies the fixed cost of ownership relative to monthly API spend. When data residency requirements block API usage. When monthly spend on a single workload justifies the fixed cost of GPU ownership and the engineering overhead of running it.
Below that threshold, API pricing is more cost-effective once you factor in infrastructure, monitoring, and fine-tuning.
Hybrid is the default for most enterprises. API for spike capacity and frontier tasks. Self-hosted fine-tuned models for high-volume, well-defined workloads.
Embedding generation and RAG retrieval are often the first workloads to bring in-house. They have high volume and low latency sensitivity. The embedding model is a neural network you can optimize on its own.
The self-hosted transformer inference economics framework has guided this decision across enterprise deployments in regulated industries. We've also seen teams bleed budget on inference when the breakeven math isn't done upfront. Getting the cost curve right is what separates a 12-month pilot from a system that compounds value for years.
What Mature LLM Cost Management Actually Looks Like
Cost predictability. Monthly spend tracks forecast closely because the model is calibrated to production token patterns, not pilot assumptions.
Margin protection. Cascade routing and output engineering reduce cost-per-interaction versus naive single-model deployments. The savings compound as volume grows.
The systems that deliver long-term value aren't the ones that picked the cheapest model. They're the ones with cost governance that adapts as token volume and model selection evolve. The infrastructure for this looks like enterprise AI cost governance that ingests usage data, flags drift, and routes automatically.
Cost governance is the moat. The model will change. The governance is what keeps margins intact.
Frequently Asked Questions
Q: Why is LLM inference cost so high if prices drop 10x every year?
Per-token prices decline 10x annually, but production token volume grows much faster. Agentic workflows, RAG pipelines, and multi-turn interactions expand output token counts. The 10x reduction applies to the rate, not the total bill. Your spend grows because consumption outpaces the price decline.
Q: What's the difference between input and output token pricing?
Output tokens cost 3-5x more than input tokens across all major providers. Gemini 3.1 Flash charges $0.10 per million input and $0.40 per million output. The multiplier exists because output generation is compute-bound while input processing is parallelizable.
Q: When should we self-host LLMs instead of using APIs?
Self-host when sustained GPU utilization justifies the fixed cost of ownership versus API spend. When data residency requirements block external API usage. Or when monthly spend on a single workload justifies the fixed cost of GPU ownership. Below those thresholds, API pricing is more cost-effective.
Q: How do we reduce LLM serving cost without losing quality?
Three high-leverage moves: cascade routing (cheap model first, escalate on confidence threshold), output engineering (structured outputs, stop sequences, max_tokens caps), and prompt compression for input context. Most teams see real cost reduction with measurable quality impact.
Q: What is a realistic production LLM cost per million tokens?
A cascade architecture with a cheap model handling the majority of requests lands well below using a frontier model for everything. It escalates to a frontier model only for complex cases. The blended cost reflects the weighted average of both tiers, with the cheap model dominating volume.
Sources
Research and references cited in this article:
- Understanding LLM Cost Per Token: A 2026 Practical Guide
- LLM Inference in 2026: How It Works, Latency & Cost - Future AGI
- Cost of tokens goes down over time. Like by a lot. And it will continue to do so... | Hacker News
- Inference Unit Economics: The True Cost Per Million Tokens - Introl
- Inference economics of language models _(academic)_
- Optimizing LLM Performance and Cost: Squeezing Every Drop of ...
- The Technical Guide to Managing LLM Costs: Strategies for Optimization and ROI
- The Practical Guide to LLM Cost Optimization
- We can Reduce Our LLM Costs by 70% — The Real Architecture Behind Scalable AI Systems (2026 Guide)…
- LLMOps Guide 2026: Build Fast, Cost-Effective LLM Apps
- The LLM Pricing Collapse of 2026: How to Build When Models Cost ...
- The Best Large Language Models (LLMs) in 2026 - Zapier
About the author
Mayank Singh is a software developer at Levitation Infotech, where he builds web and AI-powered applications across the company’s fintech, healthcare, and enterprise projects.
