Compare Hardware

RAM vs. VRAM for AI: Bandwidth Bottlenecks Explained

When running local models, you must choose between GPU VRAM and system RAM. System RAM is inexpensive, allowing you to load massive models, but it is slow. GPU VRAM is expensive but offers high bandwidth. Let's explore this memory bottleneck.

Run the Calculations Locally

Test your operational cost parameters on the interactive dashboard.

Launch the LLM RAM Calculator

1. Memory Bandwidth: The Core Bottleneck

During model execution, the system must load parameters from memory for every single token generated. A 70B parameter model requires transferring ~35GB of data per token. The generation speed is bottlenecked by the system's memory transfer bandwidth.

2. Bandwidth Comparison: DDR5 vs. GDDR6 vs. HBM

Memory transfer speeds vary by hardware: - **System RAM (DDR5)**: 60-80GB/s bandwidth. - **GPU VRAM (GDDR6)**: 500-1000GB/s bandwidth. - **Enterprise GPU (HBM3)**: Up to 3.35TB/s bandwidth.

3. Generation Speeds (Tokens Per Second)

Running Llama 3 70B on system RAM DDR5 yields slow performance (1-3 tokens/sec) due to memory bandwidth limits. Running the same model on GPU VRAM yields much faster performance (15-30 tokens/sec).

Frequently Asked Questions

Can I use a mixture of RAM and VRAM?

Yes. Tools like llama.cpp and Ollama support offloading, letting you load part of the model weights onto VRAM and the rest onto system RAM.

Is DDR5 fast enough for local AI?

DDR5 is usable for small models (8B), but will result in slow performance for larger models (70B) due to memory bandwidth constraints.