Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Understanding BPE and WordPiece | Section
Pre-training Large Language Models

bookUnderstanding BPE and WordPiece

Свайпніть щоб показати меню

Both Byte Pair Encoding (BPE) and WordPiece are subword tokenization algorithms that build a vocabulary by iteratively merging character sequences. They differ in how they decide which pairs to merge.

Byte Pair Encoding

BPE was originally a data compression algorithm. Applied to tokenization, it works as follows:

  1. Start with a vocabulary of individual characters;
  2. Count all adjacent symbol pairs in the training corpus;
  3. Merge the most frequent pair into a new token;
  4. Repeat until the target vocabulary size is reached.

For example, if "l o w" appears frequently, BPE might first merge l and o into lo, then lo and w into low. The final vocabulary contains the most common character sequences – from single characters up to full words.

BPE is used in GPT-2, GPT-3, and LLaMA.

WordPiece

WordPiece follows the same iterative structure but selects merges differently. Instead of picking the most frequent pair, it picks the merge that maximizes the likelihood of the training data under the current vocabulary. This means it prefers merges that are statistically most useful for modeling the corpus, even if they are not the most frequent.

WordPiece also prefixes subword tokens with ## to indicate they are continuations of a word – for example, "playing" might tokenize as ["play", "##ing"].

WordPiece is used in BERT and its derivatives.

Comparison

Both handle rare and unknown words the same way – by decomposing them into smaller known subword units, down to individual characters if needed.

question mark

Which of the following best describes a key difference between Byte Pair Encoding (BPE) and WordPiece tokenization?

Виберіть правильну відповідь

Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 2

Запитати АІ

expand

Запитати АІ

ChatGPT

Запитайте про що завгодно або спробуйте одне із запропонованих запитань, щоб почати наш чат

Секція 1. Розділ 2
some-alt