Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn SentencePiece and Unigram Models | Subword Tokenization Algorithms
Tokenization and Information Theory

bookSentencePiece and Unigram Models

SentencePiece is a widely used subword tokenization framework that introduces a fundamentally different approach to token segmentation compared to deterministic algorithms like Byte Pair Encoding (BPE). Unlike BPE, which always applies the same merge rules in a fixed order, SentencePiece leverages the unigram language model to enable probabilistic token selection. In this model, each possible tokenization of a given input is assigned a probability based on the likelihood of the sequence under the learned unigram distribution. This means that, rather than always producing a single segmentation, SentencePiece can sample from multiple valid segmentations according to their probabilities, or choose the most likely one for deterministic inference.

The core idea behind the unigram model is to optimize a vocabulary by maximizing the likelihood of the training corpus. It starts with a large set of possible subword tokens, then iteratively removes tokens that contribute the least to the overall likelihood, resulting in a compact yet expressive vocabulary. This probabilistic framework allows the model to capture more nuanced language patterns and to better handle rare or unseen words by considering alternative segmentations rather than failing outright or reverting to character-level splits.

12345678910111213141516171819202122232425262728
import sentencepiece as spm # In-memory corpus corpus = [ "machine learning is fun", "machine translation is challenging", "deep learning enables translation", ] # Train a unigram model directly from memory spm.SentencePieceTrainer.train( sentence_iterator=iter(corpus), model_prefix="unigram_model", vocab_size=29, # Increased to avoid required_chars error model_type="unigram" ) # Load the trained model sp = spm.SentencePieceProcessor() sp.load("unigram_model.model") # Encode a sample sentence tokens = sp.encode("machine translation is powerful", out_type=str) print("Tokenized:", tokens) # Display the vocabulary for i in range(sp.get_piece_size()): print(f"ID {i}: {sp.id_to_piece(i)}")
copy
question mark

Which statements about SentencePiece's unigram model are correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookSentencePiece and Unigram Models

Swipe to show menu

SentencePiece is a widely used subword tokenization framework that introduces a fundamentally different approach to token segmentation compared to deterministic algorithms like Byte Pair Encoding (BPE). Unlike BPE, which always applies the same merge rules in a fixed order, SentencePiece leverages the unigram language model to enable probabilistic token selection. In this model, each possible tokenization of a given input is assigned a probability based on the likelihood of the sequence under the learned unigram distribution. This means that, rather than always producing a single segmentation, SentencePiece can sample from multiple valid segmentations according to their probabilities, or choose the most likely one for deterministic inference.

The core idea behind the unigram model is to optimize a vocabulary by maximizing the likelihood of the training corpus. It starts with a large set of possible subword tokens, then iteratively removes tokens that contribute the least to the overall likelihood, resulting in a compact yet expressive vocabulary. This probabilistic framework allows the model to capture more nuanced language patterns and to better handle rare or unseen words by considering alternative segmentations rather than failing outright or reverting to character-level splits.

12345678910111213141516171819202122232425262728
import sentencepiece as spm # In-memory corpus corpus = [ "machine learning is fun", "machine translation is challenging", "deep learning enables translation", ] # Train a unigram model directly from memory spm.SentencePieceTrainer.train( sentence_iterator=iter(corpus), model_prefix="unigram_model", vocab_size=29, # Increased to avoid required_chars error model_type="unigram" ) # Load the trained model sp = spm.SentencePieceProcessor() sp.load("unigram_model.model") # Encode a sample sentence tokens = sp.encode("machine translation is powerful", out_type=str) print("Tokenized:", tokens) # Display the vocabulary for i in range(sp.get_piece_size()): print(f"ID {i}: {sp.id_to_piece(i)}")
copy
question mark

Which statements about SentencePiece's unigram model are correct?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 2. ChapterΒ 2
some-alt