QLoRA: Memory-efficient Fine-tuning
Swipe um das Menü anzuzeigen
LoRA reduces the number of trainable parameters. QLoRA goes further by also reducing the memory footprint of the frozen base model through quantization – compressing weight values from 16-bit or 32-bit floats to 4-bit integers.
How QLoRA Works
A standard LoRA setup keeps the base model in 16-bit precision. For a 7B parameter model, that is still ~14GB of GPU memory just to store the weights. QLoRA solves this by:
- Loading the base model in 4-bit NF4 quantization – reducing the 7B model to ~4GB;
- Keeping the LoRA adapters in 16-bit (bfloat16) – they remain full precision for stable gradient updates;
- Dequantizing weights on-the-fly during the forward pass, then re-quantizing after.
The adapters are the only thing updated during training. The quantized base weights are always frozen.
Implementation with bitsandbytes and PEFT
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4", # NormalFloat4 – best for LLM weights
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True # Nested quantization for extra memory savings
)
model_name = "bigscience/bloom-560m"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Apply LoRA adapters on top of the quantized base model
lora_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=["query_key_value"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
qlora_model = get_peft_model(model, lora_config)
qlora_model.print_trainable_parameters()
Run this locally if you have a CUDA GPU available. The print_trainable_parameters() output will show the quantized base model size alongside the small adapter footprint.
QLoRA vs. LoRA
Danke für Ihr Feedback!
Fragen Sie AI
Fragen Sie AI
Fragen Sie alles oder probieren Sie eine der vorgeschlagenen Fragen, um unser Gespräch zu beginnen