AI Resources

AI API Cost Benchmarks: Price vs. Intelligence Audit

Which model provides the highest intelligence per dollar spent? This benchmark report audits major LLM cost profiles against standardized coding and reasoning evaluations (MMLU, HumanEval) to identify high-leverage APIs.

Interactive LLM Pricing Specs Explorer

Model Name	Provider	Context Size	Input / MTok	Output / MTok	Cached / MTok
DeepSeek V3	DeepSeek	128k	$0.140	$0.28	$0.014
GPT-4o-mini	OpenAI	128k	$0.150	$0.60	$0.075
Gemini 1.5 Flash	Google	2M	$0.075	$0.30	$0.037
GPT-4o	OpenAI	128k	$2.500	$10.00	$1.250
Claude 3.5 Sonnet	Anthropic	200k	$3.000	$15.00	$0.300
Gemini 1.5 Pro	Google	2M	$1.250	$5.00	$0.625
Claude 3.5 Haiku	Anthropic	200k	$0.800	$4.00	$0.080
Mistral Large 2	Mistral	128k	$2.000	$6.00	$2.000
Llama 3.3 70B (Serverless)	Meta	128k	$0.350	$0.40	$0.350

Estimate Your Billing on: GPT-4o

Input Tokens / Query

Output Tokens / Query

Queries / Day

Cost Per Query

$0.0075

Daily Running Bill

$0.75

Monthly API Expense

$22.50

1. Price-to-Performance Ratio Explained

We define the Price-to-Performance Ratio as a model's benchmark score (MMLU) divided by its blended cost per million tokens. This highlights models that punch above their weight class financially.

2. Flagship Models vs. Budget Models

Flagship models (Claude 3.5 Sonnet, GPT-4o) score high (88%+ MMLU) but cost $3.00 to $5.00 per blended million tokens. Budget models (GPT-4o-mini, Gemini 1.5 Flash) score ~82% MMLU but cost under $0.30 per million tokens. For 80% of routine workflows, budget models offer 10x superior cost-efficiency.

Frequently Asked Questions

What is the most cost-effective model for coding?

Claude 3.5 Sonnet is the gold standard for complex coding, but for simple script edits, GPT-4o-mini offers excellent accuracy at 5% of the cost.

How does DeepSeek V3 fit in price-to-performance?

DeepSeek V3 achieves scores comparable to GPT-4o while costing only 10% of OpenAI's rate, currently representing the highest price-to-performance ratio in the industry.