Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
学ぶ QLoRA: Memory-efficient Fine-tuning | Section
Fine-tuning and Adapting LLMs

bookQLoRA: Memory-efficient Fine-tuning

メニューを表示するにはスワイプしてください

LoRA reduces the number of trainable parameters. QLoRA goes further by also reducing the memory footprint of the frozen base model through quantization – compressing weight values from 16-bit or 32-bit floats to 4-bit integers.

How QLoRA Works

A standard LoRA setup keeps the base model in 16-bit precision. For a 7B parameter model, that is still ~14GB of GPU memory just to store the weights. QLoRA solves this by:

  1. Loading the base model in 4-bit NF4 quantization – reducing the 7B model to ~4GB;
  2. Keeping the LoRA adapters in 16-bit (bfloat16) – they remain full precision for stable gradient updates;
  3. Dequantizing weights on-the-fly during the forward pass, then re-quantizing after.

The adapters are the only thing updated during training. The quantized base weights are always frozen.

Implementation with bitsandbytes and PEFT

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model

# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 – best for LLM weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True      # Nested quantization for extra memory savings
)

model_name = "bigscience/bloom-560m"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Apply LoRA adapters on top of the quantized base model
lora_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=["query_key_value"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

qlora_model = get_peft_model(model, lora_config)
qlora_model.print_trainable_parameters()

Run this locally if you have a CUDA GPU available. The print_trainable_parameters() output will show the quantized base model size alongside the small adapter footprint.

QLoRA vs. LoRA

question mark

Which of the following are true about QLoRA?

すべての正しい答えを選択

すべて明確でしたか?

どのように改善できますか?

フィードバックありがとうございます!

セクション 1.  6

AIに質問する

expand

AIに質問する

ChatGPT

何でも質問するか、提案された質問の1つを試してチャットを始めてください

セクション 1.  6
some-alt