How LLM Pricing Works (API Unit Economics & Cost Optimization)

Estimate Your Monthly Bills

Ready to run the math for your application? Input your token profiles and request volume into our LLM Cost Calculator to get instant projections.

Launch Cost Calculator

1. The Metrics of LLM Billing: Per-Token Pricing

Traditional software hosting is billed by CPU hours or server capacity. Large Language Models (LLMs) break this paradigm, shifting to utility metrics: tokens processed. Since model computation scales linearly with sequence length (specifically, self-attention scales quadratically, but inference operations are dominated by weight matrix multiplications), charging developers per token represents the cleanest cost-to-margin alignment for hosts.

Providers express their rates in Cost per Million Tokens (MTok). If a provider charges $2.50 / MTok input, processing a prompt containing 10,000 tokens costs exactly $0.025. While tiny in isolation, this cost scales rapidly once systems handle thousands of queries per minute.

2. The Asymmetry: Input vs. Output Costs

On almost every LLM API pricing page, input tokens are 3x to 5x cheaper than output tokens. This asymmetry is driven by the physical architecture of graphics hardware (GPUs):

Input processing (Prefill): When you send a prompt, the GPU processes all tokens in parallel. High batch sizes allow the arithmetic execution units to utilize the GPU's memory bandwidth fully, maximizing efficiency.
Output generation (Decoding): LLM generation is autoregressive. To predict token N+1, the model must read all previous tokens (1 to N) and load the model weights sequentially. This decode phase is highly bottlenecked by GPU memory bandwidth, requiring continuous weight loading for every single output token generated.

Because outputs consume disproportionate memory bandwidth, developers should structure instructions to enforce concise, structured responses, minimizing completion output size.

3. Advanced Pricing Modifiers: Prompt Caching

As context windows scale to 1 million+ tokens, sending static system prompts or document context on every request becomes financially prohibitive. To mitigate this, providers like Google, Anthropic, and DeepSeek offer Prompt Caching:

When a client sends a request, the host stores the computed Key-Value cache (KV Cache) of the prompt on the server. If a subsequent request contains the exact same prefix, the model resumes from the cached state, bypassing prefill computation.

Cache hits are rewarded with substantial discounts:

Anthropic Claude: Cache-write costs a 25% premium, but cache-read is discounted by 90%.
Google Gemini: Offers a 50% discount on inputs that hit the cache (active for contexts over 32k tokens).
DeepSeek V3: Cache-hit inputs are priced at $0.014 / MTok — a 90% discount from the baseline $0.14 rate.

4. The Unit Economics of Custom Models: Fine-Tuning

Fine-tuning allows developers to adapt model behavior, formatting, and tone to specific domains. However, deploying a fine-tuned model alters hosting economics. While base models are hosted in shared multi-tenant memory pools (allowing hosts to share GPU costs across thousands of developers), a fine-tuned model contains custom weight matrices that must be loaded onto dedicated hardware.

Consequently, fine-tuned APIs are billed differently:

Training Costs: A one-time billing per million tokens processed during the gradient descent training phase.
Inference Costs: Premium per-token rates, often 2x to 3x higher than standard model APIs.
Hosting Fees: Some hosts charge a flat hourly rate (e.g. $1.00 to $4.00/hour) for keeping the weights active in GPU memory, regardless of query volume.

5. Self-Hosting Math: The GPU Amortization Floor

For high-volume enterprise operations, variable API costs eventually exceed the capital expenditure of purchasing or leasing dedicated hardware. To evaluate this threshold:

Imagine leasing an Nvidia H100 GPU (80GB VRAM) for $2.50 / hour ($1,800 / month). If your application runs a 70B parameter model at Q4 quantization, the GPU can generate approximately 50 tokens/sec.

If active 24/7, the H100 generates:
50 tok/sec × 3600 sec × 24 hrs × 30 days = 129.6 million tokens / month.

If routing the same volume through a Claude 3.5 Sonnet API (blended rate $5.40 / MTok), the bill would be:
129.6 MTok × $5.40 = $700 / month.

In this scenario, API hosting is still cheaper than leasing dedicated hardware because the GPU is not running at 100% continuous utilization. Only when your query density utilizes the GPU capacity fully does self-hosting yield margin gains.

Model Your Application Strategy

Calculate cost projections across GPT, Claude, Gemini, Llama, and Mistral models dynamically.

Open Cost Calculator

Frequently Asked Questions

Why are output tokens more expensive than input tokens?

LLM generation is autoregressive. Generating tokens requires loading all model parameters into GPU SRAM memory sequentially, one token at a time. Processing input prompts, however, happens in parallel, letting GPUs batch the arithmetic efficiently.

What is prompt caching and how does it save money?

Prompt caching stores the context states of static headers (like system prompts or large database schemas) on the provider's server. When subsequent requests hit this cache, they are charged at a discount (often 50% to 90% off the standard rate).

How does pricing for fine-tuned models differ?

Fine-tuned models require dedicated model weights loading in GPU memories. Providers charge a higher base rate per million tokens (often 2x the standard price) and sometimes require a fixed hourly hosting fee for keeping the custom model active.

How do self-hosted LLM costs compare to API costs?

Self-hosting open weights models removes variable token rates, substituting them with fixed hardware costs (GPU purchases or cloud server leases). This is financially viable only when query volumes are high enough to amortize hardware depreciation.