Challenge: Pre-train a Miniature Language Model
Scorri per mostrare il menu
Task
Build a complete pre-training pipeline from raw text to a evaluated language model. Use any plain-text dataset you have access to – a book from Project Gutenberg, a Wikipedia dump, or any domain-specific corpus.
Your pipeline should cover all of the following steps:
- Data pipeline: load and preprocess your corpus, split it into train and validation sets;
- Tokenizer: train a BPE tokenizer on your corpus using Hugging Face
tokenizers, then encode the dataset into token sequences; - Model: implement a causal language model in PyTorch. You may use the transformer you built in Course 2 or a simpler architecture;
- Pre-training loop: implement the CLM training loop with:
- gradient accumulation to simulate a larger effective batch size;
- a cosine learning rate schedule with linear warmup;
- Evaluation: compute perplexity on the validation set after each epoch and report how it changes over training.
Keep the model small enough to train on a CPU or a single consumer GPU – a 2-4 layer transformer with d_model of 128-256 is sufficient to observe learning.
Once your model trains, experiment with the following:
- Vocabulary size and its effect on tokenization quality;
- Number of warmup steps and cosine decay length;
- Effective batch size via different
accumulation_stepsvalues; - How perplexity on the validation set evolves compared to training loss.
Note any interesting observations – for example, at what epoch does perplexity start to plateau, and what does that tell you about the model's capacity relative to your dataset size?
Grazie per i tuoi commenti!
Chieda ad AI
Chieda ad AI
Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione