Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Learn Why Out-of-Vocabulary Is Inevitable | Limits, OOV, and Theoretical Boundaries
Practice
Projects
Quizzes & Challenges
Quizzes
Challenges
/
Tokenization and Information Theory

bookWhy Out-of-Vocabulary Is Inevitable

Languages like English, German, and Turkish are considered open-vocabulary languages because new words can be created at any time. This property is known as morphological productivity β€” the ability to generate new word forms by combining roots, prefixes, suffixes, or even inventing entirely new terms. For instance, you might encounter a made-up word like hyperconnectivity or a rarely used technical term such as transmogrification. Since it is impossible to list every possible word that could ever exist, any fixed vocabulary will eventually encounter words it does not contain. This is the core reason why out-of-vocabulary (OOV) tokens are inevitable in practical tokenization systems.

Even if you train on massive text corpora, new words, names, slang, and domain-specific terms will always appear. The infinite creativity of natural language means that your tokenizer will face words it has never seen before. As a result, handling OOV tokens becomes a fundamental challenge for any tokenization approach.

Note
Definition

An OOV (out-of-vocabulary) token is any word or sequence not present in a tokenizer's known vocabulary. For example, unhappinesses might not be in a basic vocabulary, so it would be considered OOV.

Examples: OOV tokens can include:

  • Newly coined words ("cryptojacking");
  • Misspellings ("recieve");
  • Domain-specific jargon ("bioinformatics");
  • Foreign words or names ("Schwarzenegger").

Subword modelingβ€”breaking words into smaller, more frequent partsβ€”helps reduce OOVs by representing unseen words as combinations of known subwords. For example, unhappinesses might be split into un, happi, ness, es. However, this does not eliminate OOVs entirely, because truly novel character sequences, rare names, or code-like strings may still occur.

123456789101112131415161718192021222324252627
# Example: Subword tokenization of an unseen word using a simple vocabulary vocab = {"un", "happi", "ness", "es", "bio", "info", "mat", "ics"} def subword_tokenize(word, vocab): tokens = [] i = 0 while i < len(word): # Try to find the longest subword in vocab that matches the current position found = False for j in range(len(word), i, -1): sub = word[i:j] if sub in vocab: tokens.append(sub) i = j found = True break if not found: # If no subword found, treat single character as OOV tokens.append(word[i]) i += 1 return tokens # Unseen word that is not in the vocabulary word = "unhappinesses" tokens = subword_tokenize(word, vocab) print("Tokens:", tokens) # Output: Tokens: ['un', 'happi', 'ness', 'es']
copy
question mark

Which statement best describes why out-of-vocabulary (OOV) tokens are inevitable in tokenization systems?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 1

Ask AI

expand

Ask AI

ChatGPT

Ask anything or try one of the suggested questions to begin our chat

bookWhy Out-of-Vocabulary Is Inevitable

Swipe to show menu

Languages like English, German, and Turkish are considered open-vocabulary languages because new words can be created at any time. This property is known as morphological productivity β€” the ability to generate new word forms by combining roots, prefixes, suffixes, or even inventing entirely new terms. For instance, you might encounter a made-up word like hyperconnectivity or a rarely used technical term such as transmogrification. Since it is impossible to list every possible word that could ever exist, any fixed vocabulary will eventually encounter words it does not contain. This is the core reason why out-of-vocabulary (OOV) tokens are inevitable in practical tokenization systems.

Even if you train on massive text corpora, new words, names, slang, and domain-specific terms will always appear. The infinite creativity of natural language means that your tokenizer will face words it has never seen before. As a result, handling OOV tokens becomes a fundamental challenge for any tokenization approach.

Note
Definition

An OOV (out-of-vocabulary) token is any word or sequence not present in a tokenizer's known vocabulary. For example, unhappinesses might not be in a basic vocabulary, so it would be considered OOV.

Examples: OOV tokens can include:

  • Newly coined words ("cryptojacking");
  • Misspellings ("recieve");
  • Domain-specific jargon ("bioinformatics");
  • Foreign words or names ("Schwarzenegger").

Subword modelingβ€”breaking words into smaller, more frequent partsβ€”helps reduce OOVs by representing unseen words as combinations of known subwords. For example, unhappinesses might be split into un, happi, ness, es. However, this does not eliminate OOVs entirely, because truly novel character sequences, rare names, or code-like strings may still occur.

123456789101112131415161718192021222324252627
# Example: Subword tokenization of an unseen word using a simple vocabulary vocab = {"un", "happi", "ness", "es", "bio", "info", "mat", "ics"} def subword_tokenize(word, vocab): tokens = [] i = 0 while i < len(word): # Try to find the longest subword in vocab that matches the current position found = False for j in range(len(word), i, -1): sub = word[i:j] if sub in vocab: tokens.append(sub) i = j found = True break if not found: # If no subword found, treat single character as OOV tokens.append(word[i]) i += 1 return tokens # Unseen word that is not in the vocabulary word = "unhappinesses" tokens = subword_tokenize(word, vocab) print("Tokens:", tokens) # Output: Tokens: ['un', 'happi', 'ness', 'es']
copy
question mark

Which statement best describes why out-of-vocabulary (OOV) tokens are inevitable in tokenization systems?

Select the correct answer

Everything was clear?

How can we improve it?

Thanks for your feedback!

SectionΒ 3. ChapterΒ 1
some-alt