1. Model Comparison Grid
Google Gemini 1.5 Pro leads with a 2M token context. Anthropic Claude 3.5 Sonnet supports 200k tokens. Meta Llama 3.3 70B supports 128k tokens. In terms of base API pricing, Llama is the cheapest, while Gemini and Claude charge premium rates for long contexts.
2. Recall Accuracy (lost in the middle)
Claude 3.5 Sonnet maintains high recall accuracy (99.8%) across its 200k window. Gemini 1.5 Pro maintains high recall up to 1M tokens, with minor recall degradation at 2M. Llama 3.3 70B maintains high recall across its 128k context window.
3. Hardware Requirements for Self-Hosting Llama 128k
Hosting Llama 3.3 70B with a 128k context requires substantial GPU memory. The model parameters require ~40GB of VRAM, and the KV Cache adds another 20GB. Self-hosting requires dedicated GPU server nodes.