Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Вивчайте Regexp Tokenizer | Identifying the Most Frequent Words in Text
Identifying the Most Frequent Words in Text

book
Regexp Tokenizer

RegexpTokenizer is a class in NLTK designed for tokenizing text data with the use of regular expressions. These expressions are powerful patterns capable of matching specific sequences in text, like words or punctuation marks.

The RegexpTokenizer is particularly advantageous for scenarios demanding customized tokenization.

Завдання

Swipe to start coding

  1. Import the RegexpTokenizer for tokenization based on a regular expression pattern from NLTK.
  2. Create a tokenizer that splits text into words using a specific regular expression.
  3. Tokenize the lemmatized words to create a list of words.

Рішення

# Import RegexpTokenizer for tokenization based on a regular expression pattern
from nltk.tokenize import RegexpTokenizer

# Create a tokenizer that splits text into words using a regular expression
tokenizer = RegexpTokenizer(r"\w+")

# Tokenize the lemmatized words, creating a list of words
story_tokenized = tokenizer.tokenize(" ".join(lemmatized_words))

# Display the tokenized story
story_tokenized

Mark tasks as Completed
Все було зрозуміло?

Як ми можемо покращити це?

Дякуємо за ваш відгук!

Секція 1. Розділ 9
AVAILABLE TO ULTIMATE ONLY
some-alt