Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Challenge: Pre-train a Miniature Language Model | Section
Pre-training Large Language Models

bookChallenge: Pre-train a Miniature Language Model

Scorri per mostrare il menu

Task

Build a complete pre-training pipeline from raw text to a evaluated language model. Use any plain-text dataset you have access to – a book from Project Gutenberg, a Wikipedia dump, or any domain-specific corpus.

Your pipeline should cover all of the following steps:

  1. Data pipeline: load and preprocess your corpus, split it into train and validation sets;
  2. Tokenizer: train a BPE tokenizer on your corpus using Hugging Face tokenizers, then encode the dataset into token sequences;
  3. Model: implement a causal language model in PyTorch. You may use the transformer you built in Course 2 or a simpler architecture;
  4. Pre-training loop: implement the CLM training loop with:
    • gradient accumulation to simulate a larger effective batch size;
    • a cosine learning rate schedule with linear warmup;
  5. Evaluation: compute perplexity on the validation set after each epoch and report how it changes over training.

Keep the model small enough to train on a CPU or a single consumer GPU – a 2-4 layer transformer with d_model of 128-256 is sufficient to observe learning.

Once your model trains, experiment with the following:

  • Vocabulary size and its effect on tokenization quality;
  • Number of warmup steps and cosine decay length;
  • Effective batch size via different accumulation_steps values;
  • How perplexity on the validation set evolves compared to training loss.

Note any interesting observations – for example, at what epoch does perplexity start to plateau, and what does that tell you about the model's capacity relative to your dataset size?

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 11

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 11
some-alt