Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprenda QLoRA: Memory-efficient Fine-tuning | Section
Fine-tuning and Adapting LLMs

bookQLoRA: Memory-efficient Fine-tuning

Deslize para mostrar o menu

LoRA reduces the number of trainable parameters. QLoRA goes further by also reducing the memory footprint of the frozen base model through quantization – compressing weight values from 16-bit or 32-bit floats to 4-bit integers.

How QLoRA Works

A standard LoRA setup keeps the base model in 16-bit precision. For a 7B parameter model, that is still ~14GB of GPU memory just to store the weights. QLoRA solves this by:

  1. Loading the base model in 4-bit NF4 quantization – reducing the 7B model to ~4GB;
  2. Keeping the LoRA adapters in 16-bit (bfloat16) – they remain full precision for stable gradient updates;
  3. Dequantizing weights on-the-fly during the forward pass, then re-quantizing after.

The adapters are the only thing updated during training. The quantized base weights are always frozen.

Implementation with bitsandbytes and PEFT

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 – best for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True      # Nested quantization for extra memory savings
)

model_name = "bigscience/bloom-560m"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Apply LoRA adapters on top of the quantized base model
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

qlora_model = get_peft_model(model, lora_config)
qlora_model.print_trainable_parameters()

Run this locally if you have a CUDA GPU available. The print_trainable_parameters() output will show the quantized base model size alongside the small adapter footprint.

QLoRA vs. LoRA

question mark

Which of the following are true about QLoRA?

Selecione todas as respostas corretas

Tudo estava claro?

Como podemos melhorá-lo?

Obrigado pelo seu feedback!

Seção 1. Capítulo 6

Pergunte à IA

expand

Pergunte à IA

ChatGPT

Pergunte o que quiser ou experimente uma das perguntas sugeridas para iniciar nosso bate-papo

Seção 1. Capítulo 6
some-alt