QLoRA: Memory-efficient Fine-tuning
Sveip for å vise menyen
LoRA reduces the number of trainable parameters. QLoRA goes further by also reducing the memory footprint of the frozen base model through quantization – compressing weight values from 16-bit or 32-bit floats to 4-bit integers.
How QLoRA Works
A standard LoRA setup keeps the base model in 16-bit precision. For a 7B parameter model, that is still ~14GB of GPU memory just to store the weights. QLoRA solves this by:
- Loading the base model in 4-bit NF4 quantization – reducing the 7B model to ~4GB;
- Keeping the LoRA adapters in 16-bit (bfloat16) – they remain full precision for stable gradient updates;
- Dequantizing weights on-the-fly during the forward pass, then re-quantizing after.
The adapters are the only thing updated during training. The quantized base weights are always frozen.
Implementation with bitsandbytes and PEFT
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 – best for LLM weights
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # Nested quantization for extra memory savings
)
model_name = "bigscience/bloom-560m"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Apply LoRA adapters on top of the quantized base model
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query_key_value"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
qlora_model = get_peft_model(model, lora_config)
qlora_model.print_trainable_parameters()
Run this locally if you have a CUDA GPU available. The print_trainable_parameters() output will show the quantized base model size alongside the small adapter footprint.
QLoRA vs. LoRA
Takk for tilbakemeldingene dine!
Spør AI
Spør AI
Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår