Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Lära Tokenize Words | Extracting Text Meaning using TF-IDF
Extracting Text Meaning using TF-IDF

book
Tokenize Words

This phase is pivotal as it prepares the text for sophisticated NLP tasks by breaking down sentences into their constituent words and removing commonly used words that offer little semantic value. This process involves several key steps:

Preprocessing Sentences

Initially, each sentence undergoes a preprocessing routine designed to:

  • Remove non-alphabetic characters: Through the use of regular expressions (re.sub(r'[^a-zA-Z\s]', '', sentence)), all characters except for letters and spaces are stripped from the sentences. This step purifies the text, ensuring that only meaningful word content is retained;
  • Convert to lowercase: Each sentence is transformed to lowercase (sentence.lower()), standardizing the text and eliminating discrepancies that could arise from case sensitivity.

Word Tokenization

Post-preprocessing, the sentences are ready to be broken down into individual words.

Utilizing word tokenization: We apply word_tokenize to each cleaned sentence. This function segments sentences into lists of words, thereby transitioning our analysis from the sentence level to the word level, which is essential for detailed text analysis.

Stopword Removal

An integral component of text preprocessing is the removal of stopwords:

  • Defining stopwords: Stopwords (common words like "the", "is", "in", etc.) are retrieved from NLTK's text corpus 'stopwords' using stopwords.words("english"). These words, while structurally important, often carry minimal individual meaning and can clutter the analysis;
  • Filtering stopwords: Each tokenized sentence is filtered to exclude stopwords. This refinement step retains only those words that contribute significantly to the semantic content of the text, thereby enhancing the focus and efficiency of subsequent analytical processes.
Uppgift

Swipe to start coding

  1. Download the necessary NLTK modules and import functions for working with stopwords and tokenization.
  2. Tokenize each cleaned sentence into individual words.
  3. Load a set of English stopwords from NLTK's corpus.
  4. Filter out stopwords from each tokenized sentence.

Lösning

# Importing necessary modules
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

# Pre-processing each sentence
tokenized_sentences = [re.sub(r'[^a-zA-Z\s]', '', sentence).lower() for sentence in sentences]

# Tokenizing each cleaned sentence into individual words
tokenized_sentences = [word_tokenize(sentence) for sentence in tokenized_sentences]

# Loading a set of English stopwords from NLTK's corpus
stop_words = set(stopwords.words("english"))

# Filtering out stopwords from each tokenized sentence
tokenized_sentences = [[word for word in words if word not in stop_words] for words in tokenized_sentences]

# Displaying the first two tokenized and filtered sentences
tokenized_sentences[:2]

Mark tasks as Completed
Var allt tydligt?

Hur kan vi förbättra det?

Tack för dina kommentarer!

Avsnitt 1. Kapitel 6
AVAILABLE TO ULTIMATE ONLY
some-alt