Lesson 6.4.1 : Quantization

Quantization in LLMs

Quantization reduces the memory and computational requirements of neural networks by converting high-precision numbers (e.g., 32-bit floating point) into lower-precision formats (e.g., 8-bit integers). This is crucial for deploying large language models (LLMs) efficiently.

The main motivation behind quantizing deep neural networks is to improve inference speed. In neural networks, quantization converts floating point numbers into a quantized model. During model quantization, the moving average of the weights ensures that the transition from full precision to 8-bit integers remains stable.

With the advent of large language models (LLMs), the number of parameters continues to grow. And, this results in an increasingly large memory footprint. As neural networks evolve, there is a growing demand to run these models on smaller devices. These include laptops, mobile phones, and even smartwatches. Achieving this requires reducing the model size and improving efficiency, which is where quantization becomes indispensable.

Symmetric Quantization

Symmetric quantization is straightforward. It maps the range of your data around zero. The scale is calculated using the maximum absolute value in the tensor. The zero point is always set to zero. This makes the method simpler and faster. Symmetric quantization uses a zero-centered scaling factor. Asymmetric quantization, on the other hand, adjusts the scaling based on the range of the data. One significant advantage of symmetric quantization is its simplicity. With the zero point fixed at zero, there’s less to calculate. This can speed up both training and inference.

Uses a symmetric range around zero (e.g., [-127, 127] for int8).
Zero point (Z) is fixed at 0.
**Scale Factor **
- $S = \frac{\max(| \text{min(F)} |, | \text{max(F)} |)}{Q_{\text{max}}}$
**Quantization **
- $Q = \text{round}\left(\frac{F}{S}\right)$

Asymmetric Quantization

Asymmetric quantization is more flexible. Unlike symmetric quantization, it does not center around zero. The zero point can shift, allowing a better representation of the data’s range. This makes it more adaptable to data that isn’t evenly distributed. In asymmetric quantization, the scale is calculated similarly. However, the zero point is not fixed. Instead, it’s adjusted to match the data’s minimum value. This shift helps reduce quantization errors, especially when dealing with skewed data. The main advantage of asymmetric quantization is precision. By adjusting the zero point, it better captures the range of the data. This can lead to higher accuracy, especially for models with unbalanced data.

Uses an asymmetric range (e.g., [-128, 127] for int8).
Zero point (Z) is adjusted to match the input distribution.
Better for skewed data (e.g., ReLU outputs that are all ≥ 0).
Scale Factor
- $S = \frac{\text{max(F)} - \text{min(F)}}{Q_{\text{max}} - Q_{\text{min}}}$
Zero Point
- $Z = \text{round}\left(Q_{\text{min}} - \frac{\text{min(F)}}{S}\right)$
Quantization
- $Q = \text{round}\left(\frac{F}{S}\right) + Z$

Calibration in Quantization

Calibration is the process of determining optimal quantization parameters (scale factor S and zero point Z) by analyzing the statistical distribution of a neural network's weights and activations using real input data. It bridges the gap between theoretical quantization math and practical deployment.

Why Calibration Matters

Prevents significant accuracy loss during quantization
Adapts quantization ranges to actual data distribution
Critical for activation tensors (which are data-dependent)

Post-Training Quantization (PTQ) vs. Quantization-Aware Training (QAT)

Post-Training Quantization (PTQ)	Quantization-Aware Training (QAT)
Post-training quantization is a technique applied to pre-trained models to reduce model size and improve inference speed without requiring retraining	Quantization-aware training (QAT) incorporates quantization operations directly into the model training process, allowing the model to adapt and optimize its weights for quantized inference
Post-training quantization is a simpler and faster approach but may result in a slight accuracy drop	QAT requires longer training time but can achieve better accuracy and efficiency trade-offs
Post-training quantization typically uses static quantization, where quantization parameters are fixed.	QAT can employ dynamic quantization, adapting quantization parameters during inference. QAT allows for more fine-grained control over the quantization process, enabling the optimization of quantization settings for specific hardware targets or performance requirements

Steps Involved: Post-Training Quantization (PTQ)

Model Analysis:
- Profile weight/activation distributions (min-max, histograms).
Calibration:
- Feed representative data to observe dynamic ranges.
Quantization:
- Apply scale/zero-point to convert weights/activations to integers.
Validation:
- Check accuracy drop on a test set.

Steps Involved: Quantization-Aware Training (QAT)

Fake Quantization:
- Inject simulated quantization ops (rounding/clipping) into the forward pass.
Fine-Tuning:
- Retrain the model with these simulated quantized weights/activations.
Final Quantization:
- Replace fake ops with true low-precision post-training.