Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Oppiskele Designing Data Pipelines for Large-Scale Text | Section
Pre-training Large Language Models

bookDesigning Data Pipelines for Large-Scale Text

Pyyhkäise näyttääksesi valikon

Pre-training corpora can span hundreds of gigabytes or more. Loading them naively into RAM is not an option – you need a pipeline that streams, tokenizes, and batches data without becoming the bottleneck for training.

Loading with Streaming

Hugging Face datasets supports streaming mode, which reads data from disk (or the network) on demand rather than loading everything upfront:

from datasets import load_dataset

# Streaming mode – no full download required
dataset = load_dataset("text", data_files="corpus.txt", split="train", streaming=True)

for example in dataset.take(3):
    print(example)

Use streaming whenever the dataset does not fit in RAM. For smaller datasets that do fit, you can drop streaming=True and benefit from caching.

Tokenizing with map

Apply tokenization across the dataset using map. The batched=True option processes multiple examples per call, which significantly reduces overhead:

from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=512)

# num_proc parallelizes tokenization across CPU cores
tokenized = dataset.map(tokenize, batched=True, num_proc=4)

For streaming datasets, map applies transformations lazily – each batch is tokenized as it is consumed.

Batching for Training

Once tokenized, wrap the dataset in a DataLoader for efficient batching:

from torch.utils.data import DataLoader

dataloader = DataLoader(tokenized, batch_size=32, shuffle=True)

for batch in dataloader:
    input_ids = batch["input_ids"]
    # Pass to model

Set shuffle=True during training. For very large streaming datasets where full shuffling is not possible, use a shuffle buffer:

dataset = dataset.shuffle(seed=42, buffer_size=10_000)

Run a small version of this pipeline locally with a plain .txt file to verify your tokenization and batching work correctly before scaling up.

question mark

What Is a Recommended Practice for Large-Scale Data Pipelines?

Valitse oikea vastaus

Oliko kaikki selvää?

Miten voimme parantaa sitä?

Kiitos palautteestasi!

Osio 1. Luku 4

Kysy tekoälyä

expand

Kysy tekoälyä

ChatGPT

Kysy mitä tahansa tai kokeile jotakin ehdotetuista kysymyksistä aloittaaksesi keskustelumme

Osio 1. Luku 4
some-alt