Learn Byte Pair Encoding (BPE) | Subword Tokenization Algorithms

Swipe to show menu

Byte Pair Encoding, or BPE, is a greedy, frequency-driven algorithm originally designed for data compression but now widely used for subword tokenization in natural language processing. BPE works by iteratively finding the most frequent pair of adjacent symbols in a text and merging them into a new symbol. This process reduces the overall number of tokens needed to represent the text, effectively compressing it and shortening sequence length. Each merge step adds a new symbol to the vocabulary, which allows the algorithm to flexibly adapt to the structure of the language data it processes. By always merging the most frequent pairs, BPE tends to capture common patterns, such as prefixes, suffixes, or frequently co-occurring character sequences, leading to efficient encoding of both frequent words and productive subword units.

Step-by-step BPE merge clarification

Suppose your dataset contains the word banana represented as a sequence of characters: b a n a n a.

Count all pairs of adjacent symbols: ('b', 'a'), ('a', 'n'), ('n', 'a');
Find the most frequent pair; here, ('a', 'n') and ('n', 'a') both appear twice, but suppose you pick ('a', 'n') to merge first;
Merge ('a', 'n') everywhere it appears, so the sequence becomes: b an a n a;
Repeat: now the pairs are ('b', 'an'), ('an', 'a'), ('a', 'n');
Continue merging the most frequent pair in each step, building up longer subword units and reducing the total number of tokens.


              1234567891011121314151617181920212223242526272829303132333435363738
            
# Basic BPE merge step on a toy dataset
from collections import Counter

def get_stats(tokens):
    """Count frequency of all adjacent symbol pairs in the token list."""
    pairs = Counter()
    for token in tokens:
        symbols = token.split()
        for i in range(len(symbols)-1):
            pair = (symbols[i], symbols[i+1])
            pairs[pair] += 1
    return pairs

def merge_pair(tokens, pair_to_merge):
    """Merge the most frequent pair in all tokens."""
    new_tokens = []
    bigram = ' '.join(pair_to_merge)
    replacement = ''.join(pair_to_merge)
    for token in tokens:
        new_token = token.replace(bigram, replacement)
        new_tokens.append(new_token)
    return new_tokens

# Example dataset: list of words split into characters
tokens = ['b a n a n a', 'a n a n a', 'b a n']
print("Initial tokens:", tokens)

# Step 1: Count pairs
pairs = get_stats(tokens)
print("Pair frequencies:", pairs)

# Step 2: Find the most frequent pair
most_freq = pairs.most_common(1)[0][0]
print("Most frequent pair to merge:", most_freq)

# Step 3: Merge the pair
tokens = merge_pair(tokens, most_freq)
print("Tokens after merge:", tokens)

Everything was clear?

Thanks for your feedback!

Section 2. Chapter 1

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Section 2. Chapter 1