Summary  
This chapter covers how to build a scalable data pipeline by streaming large datasets on demand, applying lazy, batched, parallel transformations (like tokenization) via map, and organizing data into shuffled batches for efficient training.  

General domain of usage  
Natural language processing

Pre-training corpora can span hundreds of gigabytes or more. Loading them naively into RAM is not an option – you need a pipeline that streams, tokenizes, and batches data without becoming the bottleneck for training.



## Loading with Streaming

Hugging Face `datasets` supports streaming mode, which reads data from disk (or the network) on demand rather than loading everything upfront:

```python
from datasets import load_dataset

# Streaming mode – no full download required
dataset = load_dataset("text", data_files="corpus.txt", split="train", streaming=True)

for example in dataset.take(3):
    print(example)
```

Use streaming whenever the dataset does not fit in RAM. For smaller datasets that do fit, you can drop `streaming=True` and benefit from caching.



## Tokenizing with `map`

Apply tokenization across the dataset using `map`. The `batched=True` option processes multiple examples per call, which significantly reduces overhead:

```python
from transformers import GPT2Tokenizer

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")

def tokenize(batch):
    return tokenizer(batch["text"], truncation=True, max_length=512)

# num_proc parallelizes tokenization across CPU cores
tokenized = dataset.map(tokenize, batched=True, num_proc=4)
```

For streaming datasets, `map` applies transformations lazily – each batch is tokenized as it is consumed.



## Batching for Training

Once tokenized, wrap the dataset in a `DataLoader` for efficient batching:

```python
from torch.utils.data import DataLoader

dataloader = DataLoader(tokenized, batch_size=32, shuffle=True)

for batch in dataloader:
    input_ids = batch["input_ids"]
    # Pass to model
```

Set `shuffle=True` during training. For very large streaming datasets where full shuffling is not possible, use a shuffle buffer:

```python
dataset = dataset.shuffle(seed=42, buffer_size=10_000)
```

Run a small version of this pipeline locally with a plain `.txt` file to verify your tokenization and batching work correctly before scaling up.

What Is a Recommended Practice for Large-Scale Data Pipelines?

Master the process of training large language models from scratch: explore tokenization, data pipelines, language modeling objectives, training loops, optimization strategies, and evaluation metrics. Gain hands-on experience with Hugging Face tools and PyTorch, culminating in a capstone implementation challenge.

From raw text to a trained large language model: tokenization, data processing, training objectives, optimization, and evaluation.

Designing Data Pipelines for Large-Scale Text

Loading with Streaming

Tokenizing with map

Batching for Training

Tokenizing with `map`