Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lære Understanding BPE and WordPiece | Section
Pre-training Large Language Models

bookUnderstanding BPE and WordPiece

Sveip for å vise menyen

Both Byte Pair Encoding (BPE) and WordPiece are subword tokenization algorithms that build a vocabulary by iteratively merging character sequences. They differ in how they decide which pairs to merge.

Byte Pair Encoding

BPE was originally a data compression algorithm. Applied to tokenization, it works as follows:

  1. Start with a vocabulary of individual characters;
  2. Count all adjacent symbol pairs in the training corpus;
  3. Merge the most frequent pair into a new token;
  4. Repeat until the target vocabulary size is reached.

For example, if "l o w" appears frequently, BPE might first merge l and o into lo, then lo and w into low. The final vocabulary contains the most common character sequences – from single characters up to full words.

BPE is used in GPT-2, GPT-3, and LLaMA.

WordPiece

WordPiece follows the same iterative structure but selects merges differently. Instead of picking the most frequent pair, it picks the merge that maximizes the likelihood of the training data under the current vocabulary. This means it prefers merges that are statistically most useful for modeling the corpus, even if they are not the most frequent.

WordPiece also prefixes subword tokens with ## to indicate they are continuations of a word – for example, "playing" might tokenize as ["play", "##ing"].

WordPiece is used in BERT and its derivatives.

Comparison

Both handle rare and unknown words the same way – by decomposing them into smaller known subword units, down to individual characters if needed.

question mark

Which of the following best describes a key difference between Byte Pair Encoding (BPE) and WordPiece tokenization?

Velg det helt riktige svaret

Alt var klart?

Hvordan kan vi forbedre det?

Takk for tilbakemeldingene dine!

Seksjon 1. Kapittel 2

Spør AI

expand

Spør AI

ChatGPT

Spør om hva du vil, eller prøv ett av de foreslåtte spørsmålene for å starte chatten vår

Seksjon 1. Kapittel 2
some-alt