Building a BPE Tokenizer with Hugging Face
Свайпніть щоб показати меню
When pre-training a language model on a domain-specific corpus, you often need a tokenizer trained on that same data. Hugging Face's tokenizers library lets you build a custom BPE tokenizer in three steps: initialize, train, save.
Step 1: Initialize
from tokenizers import Tokenizer, models, pre_tokenizers
# Creating a BPE tokenizer with whitespace pre-tokenization
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
pre_tokenizers.Whitespace() splits text on whitespace before BPE merges are applied – this keeps words as the base units before subword splitting.
Step 2: Train
from tokenizers.trainers import BpeTrainer
trainer = BpeTrainer(
vocab_size=10000,
min_frequency=2,
special_tokens=["<pad>", "<unk>", "<s>", "</s>"]
)
# Training on a list of plain text files
files = ["corpus_part1.txt", "corpus_part2.txt"]
tokenizer.train(files, trainer)
min_frequency=2 means a pair must appear at least twice to be considered for merging. special_tokens are reserved entries added to the vocabulary regardless of frequency.
Step 3: Save and Load
# Saving vocabulary and merge rules to disk
tokenizer.save("bpe_tokenizer.json")
# Loading it back
tokenizer = Tokenizer.from_file("bpe_tokenizer.json")
# Encoding a sample string
encoding = tokenizer.encode("Language models learn from text.")
print(encoding.tokens)
print(encoding.ids)
Run this locally with a small .txt file as your corpus to see which merges the tokenizer learns and how it splits unseen text.
Дякуємо за ваш відгук!
Запитати АІ
Запитати АІ
Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат