Learn What Is LoRA?

Swipe to show menu

Full fine-tuning updates every parameter in the model. For a 7B parameter model, that means storing and computing gradients for 7 billion values — expensive in memory, time, and storage. Low-Rank Adaptation (LoRA) makes fine-tuning tractable by updating only a tiny fraction of additional parameters while keeping the original weights frozen.

The Core Idea

For each weight matrix $W$ in the model (typically the attention projections), LoRA introduces two small trainable matrices $A$ and $B$ such that:

W' = W + BA

where $A \in \mathbb{R}^{r \times d}$ and $B \in \mathbb{R}^{d \times r}$ , with rank $r \ll d$ . The original $W$ is frozen. Only $A$ and $B$ are updated during training.

At initialization, $B$ is set to zero so that $BA = 0$ – the adapter has no effect at the start of fine-tuning. As training progresses, the adapter learns the task-specific update direction.

Why Low Rank Works

The hypothesis behind LoRA is that the weight updates needed for fine-tuning lie in a low-dimensional subspace of the full parameter space. Instead of updating the full $d \times d$ matrix, you approximate the update with two small matrices whose product is low-rank. In practice, $r = 4$ to $r = 16$ is often sufficient.

What This Means in Practice


              12345678910111213141516171819202122
            
# A linear layer with LoRA applied manually
import torch
import torch.nn as nn

class LoRALinear(nn.Module):
    def __init__(self, in_features, out_features, rank=4):
        super().__init__()
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features), requires_grad=False
        )  # Frozen base weight
        self.lora_A = nn.Parameter(torch.randn(rank, in_features) * 0.01)
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

    def forward(self, x):
        base = x @ self.weight.T
        lora = x @ self.lora_A.T @ self.lora_B.T
        return base + lora


layer = LoRALinear(in_features=512, out_features=512, rank=4)
x = torch.rand(2, 10, 512)
print(layer(x).shape)  # Expected: torch.Size([2, 10, 512])

Run this locally and count the trainable parameters — rank × in + out × rank vs. in × out for the full matrix. With rank=4 and d=512, you train 4096 parameters instead of 262144.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 4

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 1. Chapter 4