Understanding BPE and WordPiece
Swipe to show menu
Both Byte Pair Encoding (BPE) and WordPiece are subword tokenization algorithms that build a vocabulary by iteratively merging character sequences. They differ in how they decide which pairs to merge.
Byte Pair Encoding
BPE was originally a data compression algorithm. Applied to tokenization, it works as follows:
- Start with a vocabulary of individual characters;
- Count all adjacent symbol pairs in the training corpus;
- Merge the most frequent pair into a new token;
- Repeat until the target vocabulary size is reached.
For example, if "l o w" appears frequently, BPE might first merge l and o into lo, then lo and w into low. The final vocabulary contains the most common character sequences – from single characters up to full words.
BPE is used in GPT-2, GPT-3, and LLaMA.
WordPiece
WordPiece follows the same iterative structure but selects merges differently. Instead of picking the most frequent pair, it picks the merge that maximizes the likelihood of the training data under the current vocabulary. This means it prefers merges that are statistically most useful for modeling the corpus, even if they are not the most frequent.
WordPiece also prefixes subword tokens with ## to indicate they are continuations of a word – for example, "playing" might tokenize as ["play", "##ing"].
WordPiece is used in BERT and its derivatives.
Comparison
Both handle rare and unknown words the same way – by decomposing them into smaller known subword units, down to individual characters if needed.
Thanks for your feedback!
Ask AI
Ask AI
Ask anything or try one of the suggested questions to begin our chat