Lære QLoRA: Memory-efficient Fine-tuning

Sveip for å vise menyen

LoRA reduces the number of trainable parameters. QLoRA goes further by also reducing the memory footprint of the frozen base model through quantization – compressing weight values from 16-bit or 32-bit floats to 4-bit integers.

How QLoRA Works

A standard LoRA setup keeps the base model in 16-bit precision. For a 7B parameter model, that is still ~14GB of GPU memory just to store the weights. QLoRA solves this by:

Loading the base model in 4-bit NF4 quantization – reducing the 7B model to ~4GB;
Keeping the LoRA adapters in 16-bit (bfloat16) – they remain full precision for stable gradient updates;
Dequantizing weights on-the-fly during the forward pass, then re-quantizing after.

The adapters are the only thing updated during training. The quantized base weights are always frozen.

Implementation with `bitsandbytes` and PEFT

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 – best for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True      # Nested quantization for extra memory savings
)

model_name = "bigscience/bloom-560m"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Apply LoRA adapters on top of the quantized base model
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

qlora_model = get_peft_model(model, lora_config)
qlora_model.print_trainable_parameters()

Run this locally if you have a CUDA GPU available. The print_trainable_parameters() output will show the quantized base model size alongside the small adapter footprint.

QLoRA vs. LoRA

Alt var klart?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 6

Spør AI

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 6

QLoRA: Memory-efficient Fine-tuning

How QLoRA Works

Implementation with bitsandbytes and PEFT

QLoRA vs. LoRA

Implementation with `bitsandbytes` and PEFT