Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära SentencePiece and Unigram Models | Subword Tokenization Algorithms
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Tokenization and Information Theory

bookSentencePiece and Unigram Models

SentencePiece is a widely used subword tokenization framework that introduces a fundamentally different approach to token segmentation compared to deterministic algorithms like Byte Pair Encoding (BPE). Unlike BPE, which always applies the same merge rules in a fixed order, SentencePiece leverages the unigram language model to enable probabilistic token selection. In this model, each possible tokenization of a given input is assigned a probability based on the likelihood of the sequence under the learned unigram distribution. This means that, rather than always producing a single segmentation, SentencePiece can sample from multiple valid segmentations according to their probabilities, or choose the most likely one for deterministic inference.

The core idea behind the unigram model is to optimize a vocabulary by maximizing the likelihood of the training corpus. It starts with a large set of possible subword tokens, then iteratively removes tokens that contribute the least to the overall likelihood, resulting in a compact yet expressive vocabulary. This probabilistic framework allows the model to capture more nuanced language patterns and to better handle rare or unseen words by considering alternative segmentations rather than failing outright or reverting to character-level splits.

12345678910111213141516171819202122232425262728
import sentencepiece as spm # In-memory corpus corpus = [ "machine learning is fun", "machine translation is challenging", "deep learning enables translation", ] # Train a unigram model directly from memory spm.SentencePieceTrainer.train( sentence_iterator=iter(corpus), model_prefix="unigram_model", vocab_size=29, # Increased to avoid required_chars error model_type="unigram" ) # Load the trained model sp = spm.SentencePieceProcessor() sp.load("unigram_model.model") # Encode a sample sentence tokens = sp.encode("machine translation is powerful", out_type=str) print("Tokenized:", tokens) # Display the vocabulary for i in range(sp.get_piece_size()): print(f"ID {i}: {sp.id_to_piece(i)}")
copy
question mark

Which statements about SentencePiece's unigram model are correct?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2

Fråga AI

expand

Fråga AI

ChatGPT

Fråga vad du vill eller prova någon av de föreslagna frågorna för att starta vårt samtal

Suggested prompts:

Can you explain how the unigram model in SentencePiece determines the best tokenization?

What are the main advantages of using SentencePiece over BPE for subword tokenization?

Can you walk me through the output of the provided code sample?

bookSentencePiece and Unigram Models

Svep för att visa menyn

SentencePiece is a widely used subword tokenization framework that introduces a fundamentally different approach to token segmentation compared to deterministic algorithms like Byte Pair Encoding (BPE). Unlike BPE, which always applies the same merge rules in a fixed order, SentencePiece leverages the unigram language model to enable probabilistic token selection. In this model, each possible tokenization of a given input is assigned a probability based on the likelihood of the sequence under the learned unigram distribution. This means that, rather than always producing a single segmentation, SentencePiece can sample from multiple valid segmentations according to their probabilities, or choose the most likely one for deterministic inference.

The core idea behind the unigram model is to optimize a vocabulary by maximizing the likelihood of the training corpus. It starts with a large set of possible subword tokens, then iteratively removes tokens that contribute the least to the overall likelihood, resulting in a compact yet expressive vocabulary. This probabilistic framework allows the model to capture more nuanced language patterns and to better handle rare or unseen words by considering alternative segmentations rather than failing outright or reverting to character-level splits.

12345678910111213141516171819202122232425262728
import sentencepiece as spm # In-memory corpus corpus = [ "machine learning is fun", "machine translation is challenging", "deep learning enables translation", ] # Train a unigram model directly from memory spm.SentencePieceTrainer.train( sentence_iterator=iter(corpus), model_prefix="unigram_model", vocab_size=29, # Increased to avoid required_chars error model_type="unigram" ) # Load the trained model sp = spm.SentencePieceProcessor() sp.load("unigram_model.model") # Encode a sample sentence tokens = sp.encode("machine translation is powerful", out_type=str) print("Tokenized:", tokens) # Display the vocabulary for i in range(sp.get_piece_size()): print(f"ID {i}: {sp.id_to_piece(i)}")
copy
question mark

Which statements about SentencePiece's unigram model are correct?

Select the correct answer

Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 2. Kapitel 2
some-alt