Byte Pair Encoding (BPE)
Byte Pair Encoding, or BPE, is a greedy, frequency-driven algorithm originally designed for data compression but now widely used for subword tokenization in natural language processing. BPE works by iteratively finding the most frequent pair of adjacent symbols in a text and merging them into a new symbol. This process reduces the overall number of tokens needed to represent the text, effectively compressing it and shortening sequence length. Each merge step adds a new symbol to the vocabulary, which allows the algorithm to flexibly adapt to the structure of the language data it processes. By always merging the most frequent pairs, BPE tends to capture common patterns, such as prefixes, suffixes, or frequently co-occurring character sequences, leading to efficient encoding of both frequent words and productive subword units.
Step-by-step BPE merge clarification
Suppose your dataset contains the word banana represented as a sequence of characters: b a n a n a.
- Count all pairs of adjacent symbols:
('b', 'a'),('a', 'n'),('n', 'a'); - Find the most frequent pair; here,
('a', 'n')and('n', 'a')both appear twice, but suppose you pick('a', 'n')to merge first; - Merge
('a', 'n')everywhere it appears, so the sequence becomes:b an a n a; - Repeat: now the pairs are
('b', 'an'),('an', 'a'),('a', 'n'); - Continue merging the most frequent pair in each step, building up longer subword units and reducing the total number of tokens.
1234567891011121314151617181920212223242526272829303132333435363738# Basic BPE merge step on a toy dataset from collections import Counter def get_stats(tokens): """Count frequency of all adjacent symbol pairs in the token list.""" pairs = Counter() for token in tokens: symbols = token.split() for i in range(len(symbols)-1): pair = (symbols[i], symbols[i+1]) pairs[pair] += 1 return pairs def merge_pair(tokens, pair_to_merge): """Merge the most frequent pair in all tokens.""" new_tokens = [] bigram = ' '.join(pair_to_merge) replacement = ''.join(pair_to_merge) for token in tokens: new_token = token.replace(bigram, replacement) new_tokens.append(new_token) return new_tokens # Example dataset: list of words split into characters tokens = ['b a n a n a', 'a n a n a', 'b a n'] print("Initial tokens:", tokens) # Step 1: Count pairs pairs = get_stats(tokens) print("Pair frequencies:", pairs) # Step 2: Find the most frequent pair most_freq = pairs.most_common(1)[0][0] print("Most frequent pair to merge:", most_freq) # Step 3: Merge the pair tokens = merge_pair(tokens, most_freq) print("Tokens after merge:", tokens)
¡Gracias por tus comentarios!
Pregunte a AI
Pregunte a AI
Pregunte lo que quiera o pruebe una de las preguntas sugeridas para comenzar nuestra charla
Genial!
Completion tasa mejorada a 11.11
Byte Pair Encoding (BPE)
Desliza para mostrar el menú
Byte Pair Encoding, or BPE, is a greedy, frequency-driven algorithm originally designed for data compression but now widely used for subword tokenization in natural language processing. BPE works by iteratively finding the most frequent pair of adjacent symbols in a text and merging them into a new symbol. This process reduces the overall number of tokens needed to represent the text, effectively compressing it and shortening sequence length. Each merge step adds a new symbol to the vocabulary, which allows the algorithm to flexibly adapt to the structure of the language data it processes. By always merging the most frequent pairs, BPE tends to capture common patterns, such as prefixes, suffixes, or frequently co-occurring character sequences, leading to efficient encoding of both frequent words and productive subword units.
Step-by-step BPE merge clarification
Suppose your dataset contains the word banana represented as a sequence of characters: b a n a n a.
- Count all pairs of adjacent symbols:
('b', 'a'),('a', 'n'),('n', 'a'); - Find the most frequent pair; here,
('a', 'n')and('n', 'a')both appear twice, but suppose you pick('a', 'n')to merge first; - Merge
('a', 'n')everywhere it appears, so the sequence becomes:b an a n a; - Repeat: now the pairs are
('b', 'an'),('an', 'a'),('a', 'n'); - Continue merging the most frequent pair in each step, building up longer subword units and reducing the total number of tokens.
1234567891011121314151617181920212223242526272829303132333435363738# Basic BPE merge step on a toy dataset from collections import Counter def get_stats(tokens): """Count frequency of all adjacent symbol pairs in the token list.""" pairs = Counter() for token in tokens: symbols = token.split() for i in range(len(symbols)-1): pair = (symbols[i], symbols[i+1]) pairs[pair] += 1 return pairs def merge_pair(tokens, pair_to_merge): """Merge the most frequent pair in all tokens.""" new_tokens = [] bigram = ' '.join(pair_to_merge) replacement = ''.join(pair_to_merge) for token in tokens: new_token = token.replace(bigram, replacement) new_tokens.append(new_token) return new_tokens # Example dataset: list of words split into characters tokens = ['b a n a n a', 'a n a n a', 'b a n'] print("Initial tokens:", tokens) # Step 1: Count pairs pairs = get_stats(tokens) print("Pair frequencies:", pairs) # Step 2: Find the most frequent pair most_freq = pairs.most_common(1)[0][0] print("Most frequent pair to merge:", most_freq) # Step 3: Merge the pair tokens = merge_pair(tokens, most_freq) print("Tokens after merge:", tokens)
¡Gracias por tus comentarios!