Lesson 6.4.2 : LoRA & QLoRA

LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are parameter-efficient fine-tuning (PEFT) methods that leverage low-rank matrix decomposition to adapt large language models (LLMs) with minimal compute. Here’s a deep dive into their mechanics and advantages.

1. Core Mathematical Idea: Low-Rank Adaptation

Problem with Full Fine-Tuning

A pretrained LLM has weight matrices $W ∈ {R}^{d \times k}$
Full fine-tuning updates all $d \times k$ parameters → computationally expensive.

LoRA

Freeze the original weights $W$
Introduce low-rank update matrices $A$ $A$ and $B$ $B$ .
- $W_{\text{updated}} = W + \Delta W = W + BA$
Where:
- $W \in \mathbb{R}^{d \times k}$ : Original frozen weight matrix
- $B \in \mathbb{R}^{d \times r}$ : Low-rank matrix (trainable)
- $A \in \mathbb{R}^{r \times k}$ : Low-rank matrix (trainable)
- $r$ : Rank hyperparameter $(r \ll \min(d,k))$
Reduces memory footprint: LoRA achieves this by applying a low-rank approximation to the weight update matrix (ΔW). This means it represents ΔW as the product of two smaller matrices, significantly reducing the number of parameters needed to store ΔW.
Fast fine-tuning: LoRA offers fast training times compared to traditional fine-tuning methods due to its reduced parameter footprint.
Maintains performance: LoRA has been shown to maintain performance close to traditional fine-tuning methods in several tasks.

QLoRA:

Enhances parameter efficiency: QLoRA takes LoRA a step further by also quantizing the weights of the LoRA adapters (smaller matrices) to lower precision (e.g., 4-bit instead of 8-bit). This further reduces the memory footprint and storage requirements.
More memory efficient: QLoRA is even more memory efficient than LoRA, making it ideal for resource-constrained environments.
Similar effectiveness: QLoRA has been shown to maintain similar effectiveness to LoRA in terms of performance, while offering significant memory advantages.
$W_{\text{QLoRA}} = \text{Dequantize}(W_{4\text{bit}}) + BA$