Lesson 6.4.2 : LoRA & QLoRA
LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA) are parameter-efficient fine-tuning (PEFT) methods that leverage low-rank matrix decomposition to adapt large language models (LLMs) with minimal compute. Here’s a deep dive into their mechanics and advantages.
1. Core Mathematical Idea: Low-Rank Adaptation
Problem with Full Fine-Tuning
- A pretrained LLM has weight matrices
- Full fine-tuning updates all parameters → computationally expensive.
LoRA
- Freeze the original weights
- Introduce low-rank update matrices and .
- Where:
- : Original frozen weight matrix
- : Low-rank matrix (trainable)
- : Low-rank matrix (trainable)
- : Rank hyperparameter
- Reduces memory footprint: LoRA achieves this by applying a low-rank approximation to the weight update matrix (ΔW). This means it represents ΔW as the product of two smaller matrices, significantly reducing the number of parameters needed to store ΔW.
- Fast fine-tuning: LoRA offers fast training times compared to traditional fine-tuning methods due to its reduced parameter footprint.
- Maintains performance: LoRA has been shown to maintain performance close to traditional fine-tuning methods in several tasks.
QLoRA:
- Enhances parameter efficiency: QLoRA takes LoRA a step further by also quantizing the weights of the LoRA adapters (smaller matrices) to lower precision (e.g., 4-bit instead of 8-bit). This further reduces the memory footprint and storage requirements.
- More memory efficient: QLoRA is even more memory efficient than LoRA, making it ideal for resource-constrained environments.
- Similar effectiveness: QLoRA has been shown to maintain similar effectiveness to LoRA in terms of performance, while offering significant memory advantages.