SentencePiece and Unigram Models
SentencePiece is a widely used subword tokenization framework that introduces a fundamentally different approach to token segmentation compared to deterministic algorithms like Byte Pair Encoding (BPE). Unlike BPE, which always applies the same merge rules in a fixed order, SentencePiece leverages the unigram language model to enable probabilistic token selection. In this model, each possible tokenization of a given input is assigned a probability based on the likelihood of the sequence under the learned unigram distribution. This means that, rather than always producing a single segmentation, SentencePiece can sample from multiple valid segmentations according to their probabilities, or choose the most likely one for deterministic inference.
The core idea behind the unigram model is to optimize a vocabulary by maximizing the likelihood of the training corpus. It starts with a large set of possible subword tokens, then iteratively removes tokens that contribute the least to the overall likelihood, resulting in a compact yet expressive vocabulary. This probabilistic framework allows the model to capture more nuanced language patterns and to better handle rare or unseen words by considering alternative segmentations rather than failing outright or reverting to character-level splits.
12345678910111213141516171819202122232425262728import sentencepiece as spm # In-memory corpus corpus = [ "machine learning is fun", "machine translation is challenging", "deep learning enables translation", ] # Train a unigram model directly from memory spm.SentencePieceTrainer.train( sentence_iterator=iter(corpus), model_prefix="unigram_model", vocab_size=29, # Increased to avoid required_chars error model_type="unigram" ) # Load the trained model sp = spm.SentencePieceProcessor() sp.load("unigram_model.model") # Encode a sample sentence tokens = sp.encode("machine translation is powerful", out_type=str) print("Tokenized:", tokens) # Display the vocabulary for i in range(sp.get_piece_size()): print(f"ID {i}: {sp.id_to_piece(i)}")
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat
Awesome!
Completion rate improved to 11.11
SentencePiece and Unigram Models
Swipe to show menu
SentencePiece is a widely used subword tokenization framework that introduces a fundamentally different approach to token segmentation compared to deterministic algorithms like Byte Pair Encoding (BPE). Unlike BPE, which always applies the same merge rules in a fixed order, SentencePiece leverages the unigram language model to enable probabilistic token selection. In this model, each possible tokenization of a given input is assigned a probability based on the likelihood of the sequence under the learned unigram distribution. This means that, rather than always producing a single segmentation, SentencePiece can sample from multiple valid segmentations according to their probabilities, or choose the most likely one for deterministic inference.
The core idea behind the unigram model is to optimize a vocabulary by maximizing the likelihood of the training corpus. It starts with a large set of possible subword tokens, then iteratively removes tokens that contribute the least to the overall likelihood, resulting in a compact yet expressive vocabulary. This probabilistic framework allows the model to capture more nuanced language patterns and to better handle rare or unseen words by considering alternative segmentations rather than failing outright or reverting to character-level splits.
12345678910111213141516171819202122232425262728import sentencepiece as spm # In-memory corpus corpus = [ "machine learning is fun", "machine translation is challenging", "deep learning enables translation", ] # Train a unigram model directly from memory spm.SentencePieceTrainer.train( sentence_iterator=iter(corpus), model_prefix="unigram_model", vocab_size=29, # Increased to avoid required_chars error model_type="unigram" ) # Load the trained model sp = spm.SentencePieceProcessor() sp.load("unigram_model.model") # Encode a sample sentence tokens = sp.encode("machine translation is powerful", out_type=str) print("Tokenized:", tokens) # Display the vocabulary for i in range(sp.get_piece_size()): print(f"ID {i}: {sp.id_to_piece(i)}")
Thanks for your feedback!