Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Challenge: Pre-train a Miniature Language Model | Section
Pre-training Large Language Models

bookChallenge: Pre-train a Miniature Language Model

Свайпніть щоб показати меню

Task

Build a complete pre-training pipeline from raw text to a evaluated language model. Use any plain-text dataset you have access to – a book from Project Gutenberg, a Wikipedia dump, or any domain-specific corpus.

Your pipeline should cover all of the following steps:

  1. Data pipeline: load and preprocess your corpus, split it into train and validation sets;
  2. Tokenizer: train a BPE tokenizer on your corpus using Hugging Face tokenizers, then encode the dataset into token sequences;
  3. Model: implement a causal language model in PyTorch. You may use the transformer you built in Course 2 or a simpler architecture;
  4. Pre-training loop: implement the CLM training loop with:
    • gradient accumulation to simulate a larger effective batch size;
    • a cosine learning rate schedule with linear warmup;
  5. Evaluation: compute perplexity on the validation set after each epoch and report how it changes over training.

Keep the model small enough to train on a CPU or a single consumer GPU – a 2-4 layer transformer with d_model of 128-256 is sufficient to observe learning.

Once your model trains, experiment with the following:

  • Vocabulary size and its effect on tokenization quality;
  • Number of warmup steps and cosine decay length;
  • Effective batch size via different accumulation_steps values;
  • How perplexity on the validation set evolves compared to training loss.

Note any interesting observations – for example, at what epoch does perplexity start to plateau, and what does that tell you about the model's capacity relative to your dataset size?

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 11

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 11
some-alt