Which model has the highest context limit?

Google's Gemini 1.5 Pro, which supports a context window of 2 million tokens.

What is the cheapest long-context API?

DeepSeek V3 offers a 128k context window at $0.14 per million input tokens, making it highly cost-effective.

Long-Context Models Comparison: Gemini, Claude, and Llama | ToolStrategyHub

1. Model Comparison Grid

Google Gemini 1.5 Pro leads with a 2M token context. Anthropic Claude 3.5 Sonnet supports 200k tokens. Meta Llama 3.3 70B supports 128k tokens. In terms of base API pricing, Llama is the cheapest, while Gemini and Claude charge premium rates for long contexts.

2. Recall Accuracy (lost in the middle)

Claude 3.5 Sonnet maintains high recall accuracy (99.8%) across its 200k window. Gemini 1.5 Pro maintains high recall up to 1M tokens, with minor recall degradation at 2M. Llama 3.3 70B maintains high recall across its 128k context window.

3. Hardware Requirements for Self-Hosting Llama 128k

Hosting Llama 3.3 70B with a 128k context requires substantial GPU memory. The model parameters require ~40GB of VRAM, and the KV Cache adds another 20GB. Self-hosting requires dedicated GPU server nodes.

Long-Context Models: Google, Anthropic, and Meta Compared

Run the Calculations Locally

1. Model Comparison Grid

2. Recall Accuracy (lost in the middle)

3. Hardware Requirements for Self-Hosting Llama 128k

Frequently Asked Questions