Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Impara Building a BPE Tokenizer with Hugging Face | Section
Pre-training Large Language Models

bookBuilding a BPE Tokenizer with Hugging Face

Scorri per mostrare il menu

When pre-training a language model on a domain-specific corpus, you often need a tokenizer trained on that same data. Hugging Face's tokenizers library lets you build a custom BPE tokenizer in three steps: initialize, train, save.

Step 1: Initialize

from tokenizers import Tokenizer, models, pre_tokenizers

# Creating a BPE tokenizer with whitespace pre-tokenization
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

pre_tokenizers.Whitespace() splits text on whitespace before BPE merges are applied – this keeps words as the base units before subword splitting.

Step 2: Train

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(
    vocab_size=10000,
    min_frequency=2,
    special_tokens=["<pad>", "<unk>", "<s>", "</s>"]
)

# Training on a list of plain text files
files = ["corpus_part1.txt", "corpus_part2.txt"]
tokenizer.train(files, trainer)

min_frequency=2 means a pair must appear at least twice to be considered for merging. special_tokens are reserved entries added to the vocabulary regardless of frequency.

Step 3: Save and Load

# Saving vocabulary and merge rules to disk
tokenizer.save("bpe_tokenizer.json")

# Loading it back
tokenizer = Tokenizer.from_file("bpe_tokenizer.json")

# Encoding a sample string
encoding = tokenizer.encode("Language models learn from text.")
print(encoding.tokens)
print(encoding.ids)

Run this locally with a small .txt file as your corpus to see which merges the tokenizer learns and how it splits unseen text.

question mark

Which of the following is NOT a required step when building a BPE tokenizer with Hugging Face's tokenizers library?

Seleziona la risposta corretta

Tutto è chiaro?

Come possiamo migliorarlo?

Grazie per i tuoi commenti!

Sezione 1. Capitolo 3

Chieda ad AI

expand

Chieda ad AI

ChatGPT

Chieda pure quello che desidera o provi una delle domande suggerite per iniziare la nostra conversazione

Sezione 1. Capitolo 3
some-alt