Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Aprende Building a BPE Tokenizer with Hugging Face | Section
Pre-training Large Language Models

bookBuilding a BPE Tokenizer with Hugging Face

Desliza para mostrar el menú

When pre-training a language model on a domain-specific corpus, you often need a tokenizer trained on that same data. Hugging Face's tokenizers library lets you build a custom BPE tokenizer in three steps: initialize, train, save.

Step 1: Initialize

from tokenizers import Tokenizer, models, pre_tokenizers

# Creating a BPE tokenizer with whitespace pre-tokenization
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

pre_tokenizers.Whitespace() splits text on whitespace before BPE merges are applied – this keeps words as the base units before subword splitting.

Step 2: Train

from tokenizers.trainers import BpeTrainer

trainer = BpeTrainer(
    vocab_size=10000,
    min_frequency=2,
    special_tokens=["<pad>", "<unk>", "<s>", "</s>"]
)

# Training on a list of plain text files
files = ["corpus_part1.txt", "corpus_part2.txt"]
tokenizer.train(files, trainer)

min_frequency=2 means a pair must appear at least twice to be considered for merging. special_tokens are reserved entries added to the vocabulary regardless of frequency.

Step 3: Save and Load

# Saving vocabulary and merge rules to disk
tokenizer.save("bpe_tokenizer.json")

# Loading it back
tokenizer = Tokenizer.from_file("bpe_tokenizer.json")

# Encoding a sample string
encoding = tokenizer.encode("Language models learn from text.")
print(encoding.tokens)
print(encoding.ids)

Run this locally with a small .txt file as your corpus to see which merges the tokenizer learns and how it splits unseen text.

question mark

Which of the following is NOT a required step when building a BPE tokenizer with Hugging Face's tokenizers library?

Selecciona la respuesta correcta

¿Todo estuvo claro?

¿Cómo podemos mejorarlo?

¡Gracias por tus comentarios!

Sección 1. Capítulo 3

Pregunte a AI

expand

Pregunte a AI

ChatGPT

Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla

Sección 1. Capítulo 3
some-alt