Local AI Hosting

How Much RAM for Local LLMs?

An engineering guide on quantization levels (GGUF/EXL2), KV Cache memory allocation, GPU VRAM requirements, and hardware configuration tiers.

Calculate Your Hardware Sizing

Sizing a specific local model? Use our LLM RAM Calculator to select parameters and find recommended GPU / VRAM specifications instantly.

Launch RAM Calculator

1. The Hardware Bottleneck: VRAM vs. System RAM

When self-hosting open-weight models (like Llama 3 or Mistral), output speed is dictated by **memory bandwidth**. During inference, the GPU must fetch billions of parameters from memory to process each word. Dedicated graphics memory (VRAM) operates at massive bandwidths: an Nvidia RTX 4090 moves data at 1,008 GB/sec. Standard system RAM, however, operates at only 50 to 90 GB/sec.

If a model's size fits entirely inside VRAM, generation is extremely fast (30 to 80 tokens per second). If the model exceeds the VRAM ceiling and spills into system RAM, the GPU must fetch weights over the slow PCIe system bus, causing output speed to crawl to 1-3 tokens per second—making it unusable for real-time applications.

2. The Math of Model Compression: Quantization

By default, models are trained at 16-bit precision (FP16), consuming 2 bytes of memory per parameter. A 7B parameter model at FP16 requires 14 GB of memory just to load the weights. A 70B parameter model requires 140 GB, putting it far out of reach for consumer GPUs.

Quantization solves this by converting 16-bit float values into lower-bit representations (like 4-bit or 5-bit integers) using mapping algorithms. This reduces memory footprint dramatically:

  • 16-bit (FP16): 2.0 bytes per parameter. (7B model = 14.0 GB weights)
  • 8-bit (Q8_0): 1.0 byte per parameter. (7B model = 7.0 GB weights)
  • 4-bit (Q4_K_M): ~0.5 bytes per parameter. (7B model = 3.5 GB weights)
  • 2-bit (Q2_K): ~0.25 bytes per parameter. (7B model = 1.75 GB weights)

Quantization reduces file sizes on disk and in memory, but introduces a slight math rounding error (measured as "perplexity loss"). Empirically, 4-bit quantization provides the ultimate sweet spot—reducing file size by 75% with a perplexity increase that is virtually imperceptible in conversation.

3. The KV Cache Memory Formula

In addition to model weights, runners must store Key-Value (KV) tensors for every token processed in the context window. This memory buffer is called the KV Cache. It prevents the model from re-evaluating conversation history from scratch on every turn.

The memory size of the KV cache scales linearly with context length and batch size:

KV_VRAM (GB) ≈ 2 * Layers * KV_Heads * (HiddenSize / AttentionHeads) * Context_Length * Batch_Size * 2 bytes (FP16) / 1024^3

For modern Grouped-Query Attention (GQA) models, a quick empirical estimate is:
KV_VRAM ≈ Context_Length × Batch_Size × Model_Size × 0.00000015 GB.

At 8,192 context length and batch size 1, a 70B model's KV Cache consumes ~1.0 GB of memory. However, if running a multi-user server at batch size 16 with 32,000 context limits, the KV Cache alone consumes over 16 GB of VRAM, requiring developers to size their graphics hardware accordingly.

4. Recommended Hardware Configuration Tiers

Sizing your hardware configuration depends on which parameter scales you plan to run:

Budget / Entry Tier

Run 7B / 8B models (4-bit). Requires 8GB VRAM.
GPU: Nvidia RTX 4060 8GB / RTX 3060 12GB. Standard 16GB System RAM.

Developer / Pro Tier

Run 13B / 32B models (4-bit). Requires 16GB-24GB VRAM.
GPU: Nvidia RTX 4090 24GB / RTX 3090 24GB / Mac Studio 32GB Unified Memory.

Workstation / Server Tier

Run 70B+ models (4-bit). Requires 48GB+ VRAM.
GPU: Dual RTX 3090/4090 (48GB VRAM) / Mac Studio 64GB or 128GB Unified Memory.

Run Your Model Weights Math

Input custom parameters, context buffers, and quantization levels to calculate VRAM limits instantly.

Open RAM Calculator

Frequently Asked Questions

What happens if a model size exceeds my GPU's VRAM?

Popular runners like Ollama or llama.cpp will offload the remaining layers to system CPU RAM. While the model will still run, CPU memory bandwidth is 5x to 10x slower than GPU VRAM, causing output speed to crawl (often below 2 tokens/sec).

Which quantization bit level represents the best sweetspot?

4-bit quantization (specifically the Q4_K_M GGUF format) is the universal developer standard. It reduces model file sizes by over 70% with negligible reasoning degradation, allowing consumer GPUs to host advanced weights.

Why are Apple Silicon Macs so popular for local LLMs?

Macs use unified memory, meaning the CPU and integrated GPU share the same high-speed RAM. A Mac Studio with 128GB of Unified Memory can run a massive 70B parameter model at 8-bit precision, a feat that would require multiple expensive Nvidia GPUs on a PC.

How does batch size affect local VRAM requirements?

Batch size multiplies the KV cache memory footprint. For single-user local prototyping (batch size 1), cache memory is small (under 1 GB). For multi-user servers, high batch sizes (e.g. 16 or 32) can consume 10GB+ of additional VRAM.