Summary  
This chapter explains probabilistic subword tokenization using a unigram language model to optimize a compact vocabulary and enable sampling among multiple valid segmentations.  

General domain of usage  
Natural language processing

SentencePiece is a widely used subword tokenization framework that introduces a fundamentally different approach to token segmentation compared to deterministic algorithms like Byte Pair Encoding (BPE). Unlike BPE, which always applies the same merge rules in a fixed order, SentencePiece leverages the unigram language model to enable probabilistic token selection. In this model, each possible tokenization of a given input is assigned a probability based on the likelihood of the sequence under the learned unigram distribution. This means that, rather than always producing a single segmentation, SentencePiece can sample from multiple valid segmentations according to their probabilities, or choose the most likely one for deterministic inference.

The core idea behind the unigram model is to optimize a vocabulary by maximizing the likelihood of the training corpus. It starts with a large set of possible subword tokens, then iteratively removes tokens that contribute the least to the overall likelihood, resulting in a compact yet expressive vocabulary. This probabilistic framework allows the model to capture more nuanced language patterns and to better handle rare or unseen words by considering alternative segmentations rather than failing outright or reverting to character-level splits.


import sentencepiece as spm

# In-memory corpus
corpus = [
    "machine learning is fun",
    "machine translation is challenging",
    "deep learning enables translation",
]

# Train a unigram model directly from memory
spm.SentencePieceTrainer.train(
    sentence_iterator=iter(corpus),
    model_prefix="unigram_model",
    vocab_size=29,          # Increased to avoid required_chars error
    model_type="unigram"
)

# Load the trained model
sp = spm.SentencePieceProcessor()
sp.load("unigram_model.model")

# Encode a sample sentence
tokens = sp.encode("machine translation is powerful", out_type=str)
print("Tokenized:", tokens)

# Display the vocabulary
for i in range(sp.get_piece_size()):
    print(f"ID {i}: {sp.id_to_piece(i)}")

Which statements about SentencePiece's unigram model are correct?

Explore the foundations of tokenization in modern language models through the lens of information theory and compression. Understand how subword vocabularies are constructed, how entropy shapes token distributions, and the theoretical limits of tokenization strategies.

Examine tokenization as a compression problem, exploring the mapping of text to symbols, the role of entropy, and the trade-offs in vocabulary size.

Delve into the mechanics of subword tokenization algorithms, including BPE, SentencePiece, and the statistical properties of token distributions.

Investigate the inevitability of out-of-vocabulary tokens, theoretical compression bounds, and failure modes in tokenization.

SentencePiece and Unigram Models