1. Memory Bandwidth: The Core Bottleneck
During model execution, the system must load parameters from memory for every single token generated. A 70B parameter model requires transferring ~35GB of data per token. The generation speed is bottlenecked by the system's memory transfer bandwidth.
2. Bandwidth Comparison: DDR5 vs. GDDR6 vs. HBM
Memory transfer speeds vary by hardware: - **System RAM (DDR5)**: 60-80GB/s bandwidth. - **GPU VRAM (GDDR6)**: 500-1000GB/s bandwidth. - **Enterprise GPU (HBM3)**: Up to 3.35TB/s bandwidth.
3. Generation Speeds (Tokens Per Second)
Running Llama 3 70B on system RAM DDR5 yields slow performance (1-3 tokens/sec) due to memory bandwidth limits. Running the same model on GPU VRAM yields much faster performance (15-30 tokens/sec).