Local Hardware

LLM Quantization: Compression Formats and Accuracy Tradeoffs

Large language models are trained at high mathematical precision (FP16 or BF16). However, storing these weights requires substantial VRAM. Quantization compresses these weights to lower precisions. Let's explore how this process works.

Interactive LLM RAM Calculator

Want to calculate your exact parameters and operational expenses? Run the calculations locally inside your browser.

Launch LLM RAM Calculator

1. What Is Quantization?

Quantization projects continuous high-precision floating-point weights (like 16-bit floats) onto discrete, lower-precision integer grids (like 8-bit or 4-bit integers), reducing weight file sizes by 50% to 75%.

2. Quantization Formats: GGUF vs. GPTQ vs. EXL2

Different formats serve different backends: - **GGUF**: Best for CPU/GPU hybrid execution via llama.cpp. - **GPTQ**: Best for pure Nvidia GPU execution. - **EXL2**: Optimized for fast generation speeds on Nvidia GPUs.

3. The Accuracy Tradeoff (Perplexity Surcharge)

Compressing model weights introduces rounding errors that can slightly degrade accuracy, measured as perplexity. 8-bit and 4-bit quantizations (Q8, Q4) maintain near-identical accuracy to FP16, while 2-bit quantizations (Q2) exhibit noticeable quality degradation.

Frequently Asked Questions

Does quantization degrade model performance?

Q4 quantization offers a 70% VRAM reduction with negligible accuracy loss. Q2 quantization reduces file sizes further but results in noticeable degradation.

What is GGUF?

A file format designed by the llama.cpp team that stores model weights in a single file, supporting fast CPU loading and execution.