Building a BPE Tokenizer with Hugging Face
Desliza para mostrar el menú
When pre-training a language model on a domain-specific corpus, you often need a tokenizer trained on that same data. Hugging Face's tokenizers library lets you build a custom BPE tokenizer in three steps: initialize, train, save.
Step 1: Initialize
from tokenizers import Tokenizer, models, pre_tokenizers
# Creating a BPE tokenizer with whitespace pre-tokenization
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
pre_tokenizers.Whitespace() splits text on whitespace before BPE merges are applied – this keeps words as the base units before subword splitting.
Step 2: Train
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(
vocab_size=10000,
min_frequency=2,
special_tokens=["<pad>", "<unk>", "<s>", "</s>"]
)
# Training on a list of plain text files
files = ["corpus_part1.txt", "corpus_part2.txt"]
tokenizer.train(files, trainer)
min_frequency=2 means a pair must appear at least twice to be considered for merging. special_tokens are reserved entries added to the vocabulary regardless of frequency.
Step 3: Save and Load
# Saving vocabulary and merge rules to disk
tokenizer.save("bpe_tokenizer.json")
# Loading it back
tokenizer = Tokenizer.from_file("bpe_tokenizer.json")
# Encoding a sample string
encoding = tokenizer.encode("Language models learn from text.")
print(encoding.tokens)
print(encoding.ids)
Run this locally with a small .txt file as your corpus to see which merges the tokenizer learns and how it splits unseen text.
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla