AI Resources

LLM Context Window Sizes & Specifications Database

This reference directory lists context window specifications, output token limits, and prompt caching capabilities for major large language models.

LLM Context Window Limit Visualizer

Gemini 1.5 Pro2,000,000 tokens

Claude 3.5 Sonnet200,000 tokens

GPT-4o128,000 tokens

Llama 3.3 70B128,000 tokens

DeepSeek V3128,000 tokens

Mistral Large 2128,000 tokens

Context KV Cache Memory Calculator

Filling the context window allocates active Key-Value (KV) memory on the GPU. Estimate KV Cache VRAM allocation for a 70B parameter model.

Active Context Length: 15,000 tokens

Estimated KV Cache VRAM

For Llama 3 70B FP16 execution

2.20 GB

1. Model Context Specifications Table

Below is a consolidated list of model specifications and input/output pricing.

2. Understanding Output Token Ceilings

A model's context window represents its total capacity. However, models also have output token ceilings. For example, GPT-4o has a 128k context but is capped at 4k output tokens per request.

Frequently Asked Questions

Why are output limits different from context windows?

Autoregressive generation is slow and computationally demanding. Providers restrict output sizes to prevent individual queries from hogging GPU resources.

Which model has the largest output limit?

Anthropic's Claude 3.5 Sonnet supports up to 8,000 output tokens per request, while Claude 3.5 Opus is capped at 4,096 tokens.