Summary  
This chapter explains subword tokenization by iteratively merging character sequences into tokens using either frequency‐based (BPE) or likelihood‐based (WordPiece) criteria.

General domain of usage  
Text tokenization for natural language processing

Both **Byte Pair Encoding (BPE)** and **WordPiece** are subword tokenization algorithms that build a vocabulary by iteratively merging character sequences. They differ in how they decide which pairs to merge.

## Byte Pair Encoding

BPE was originally a data compression algorithm. Applied to tokenization, it works as follows:

1. Start with a vocabulary of individual characters;
2. Count all adjacent symbol pairs in the training corpus;
3. Merge the most frequent pair into a new token;
4. Repeat until the target vocabulary size is reached.

For example, if `"l o w"` appears frequently, BPE might first merge `l` and `o` into `lo`, then `lo` and `w` into `low`. The final vocabulary contains the most common character sequences – from single characters up to full words.

BPE is used in GPT-2, GPT-3, and LLaMA.

## WordPiece

WordPiece follows the same iterative structure but selects merges differently. Instead of picking the most frequent pair, it picks the merge that **maximizes the likelihood of the training data** under the current vocabulary. This means it prefers merges that are statistically most useful for modeling the corpus, even if they are not the most frequent.

WordPiece also prefixes subword tokens with `##` to indicate they are continuations of a word – for example, `"playing"` might tokenize as `["play", "##ing"]`.

WordPiece is used in BERT and its derivatives.

## Comparison





Both handle rare and unknown words the same way – by decomposing them into smaller known subword units, down to individual characters if needed.

Which of the following best describes a key difference between Byte Pair Encoding (BPE) and WordPiece tokenization?

Master the process of training large language models from scratch: explore tokenization, data pipelines, language modeling objectives, training loops, optimization strategies, and evaluation metrics. Gain hands-on experience with Hugging Face tools and PyTorch, culminating in a capstone implementation challenge.

From raw text to a trained large language model: tokenization, data processing, training objectives, optimization, and evaluation.

Understanding BPE and WordPiece

Byte Pair Encoding

WordPiece

Comparison