Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Tokenization | Text Preprocessing Fundamentals
course content

Conteúdo do Curso

Introduction to NLP

TokenizationTokenization

Before actually diving into the process of tokenization, we have to first define what tokens are.

Tokens are independent and minimal text components, that have a specific syntax and semantics.

Consequently, tokenization is the process of splitting the text into tokens. For example, a paragraph of text, a text document or a text corpus consists of several components that can be divided into sentences, phrases, and words. In fact,the most popular tokenization methods include sentence and word tokenization, which is used to break a text document (or corpus) into sentences and each sentence into words.

A text corpus (plural: corpora) is a large and structured set of texts used in linguistic and computational linguistics research. Essentially, it's a comprehensive collection of written or spoken material that serves as a representative sample of a particular language, dialect, or subject area.

Sentence Tokenization

Let's start off with sentence tokenization. Luckily for us, nltk provides the sent_tokenize() function in the tokenize module. The primary purpose of this function is to split a given text into a list of sentences.

sent_tokenize() utilizes a pre-trained model, typically a machine learning model that has been trained on a large corpus of text, to identify the boundaries between sentences. It takes into consideration various cues in the text, such as punctuation marks (e.g., periods, exclamation points, question marks), capitalization, and other linguistic patterns that typically mark the end of one sentence and the beginning of another.

Let's take a look at an example to make things clear:

As you can see, there is nothing complicated here. You should simply pass a string with your text as an argument of sent_tokenize() to obtain a list of sentences. Speaking of nltk.download('punkt'), this command specifically downloads the "Punkt" tokenizer models. By downloading the Punkt tokenizer models, you ensure that NLTK has the necessary data to perform accurate sentence and word tokenization.

The punctuation marks at the end of each sentence are included in the sentence.

Word Tokenization

In word tokenization, there are several common methods to perform it; however, in this chapter, we'll discuss only the two most prevalent ones.

The most straightforward and simplest method is to use the split() function of the string class, which uses newline symbols, spaces, and tabs as delimiters by default. However, you can also pass an arbitrary string as its argument to serve as the delimiter.

Here is an example:

To ensure that tokens like 'This' and 'this' are treated as the same, it is important to convert the string to lowercase before tokenization.

A more flexible, approach, however, is to use the word_tokenize() function in the tokenize module of the nltk library. This function identifyies and separates words based on spaces and punctuation marks, effectively breaking down sentences into their constituent words. Similarly to sent_tokenize(), this function requires a string as its argument.

Let's compare this approach with using the split() method. The example below uses word_tokenize():

Let's now see how the split() method performs with the same text:

In our example, word_tokenize(), contrary to split(), accurately identifies punctuation and special characters as separate tokens. It correctly separates the dollar sign from the numeral and recognizes periods as standalone tokens. This nuanced tokenization is crucial for many NLP tasks, where the precise delineation of words and punctuation can significantly impact the analysis's accuracy and insights.

Given the sentence "It wasn't me, I swear!", what will be the result of applying the split() method on it?

Selecione a resposta correta

Tudo estava claro?

Seção 1. Capítulo 3
course content

Conteúdo do Curso

Introduction to NLP

TokenizationTokenization

Before actually diving into the process of tokenization, we have to first define what tokens are.

Tokens are independent and minimal text components, that have a specific syntax and semantics.

Consequently, tokenization is the process of splitting the text into tokens. For example, a paragraph of text, a text document or a text corpus consists of several components that can be divided into sentences, phrases, and words. In fact,the most popular tokenization methods include sentence and word tokenization, which is used to break a text document (or corpus) into sentences and each sentence into words.

A text corpus (plural: corpora) is a large and structured set of texts used in linguistic and computational linguistics research. Essentially, it's a comprehensive collection of written or spoken material that serves as a representative sample of a particular language, dialect, or subject area.

Sentence Tokenization

Let's start off with sentence tokenization. Luckily for us, nltk provides the sent_tokenize() function in the tokenize module. The primary purpose of this function is to split a given text into a list of sentences.

sent_tokenize() utilizes a pre-trained model, typically a machine learning model that has been trained on a large corpus of text, to identify the boundaries between sentences. It takes into consideration various cues in the text, such as punctuation marks (e.g., periods, exclamation points, question marks), capitalization, and other linguistic patterns that typically mark the end of one sentence and the beginning of another.

Let's take a look at an example to make things clear:

As you can see, there is nothing complicated here. You should simply pass a string with your text as an argument of sent_tokenize() to obtain a list of sentences. Speaking of nltk.download('punkt'), this command specifically downloads the "Punkt" tokenizer models. By downloading the Punkt tokenizer models, you ensure that NLTK has the necessary data to perform accurate sentence and word tokenization.

The punctuation marks at the end of each sentence are included in the sentence.

Word Tokenization

In word tokenization, there are several common methods to perform it; however, in this chapter, we'll discuss only the two most prevalent ones.

The most straightforward and simplest method is to use the split() function of the string class, which uses newline symbols, spaces, and tabs as delimiters by default. However, you can also pass an arbitrary string as its argument to serve as the delimiter.

Here is an example:

To ensure that tokens like 'This' and 'this' are treated as the same, it is important to convert the string to lowercase before tokenization.

A more flexible, approach, however, is to use the word_tokenize() function in the tokenize module of the nltk library. This function identifyies and separates words based on spaces and punctuation marks, effectively breaking down sentences into their constituent words. Similarly to sent_tokenize(), this function requires a string as its argument.

Let's compare this approach with using the split() method. The example below uses word_tokenize():

Let's now see how the split() method performs with the same text:

In our example, word_tokenize(), contrary to split(), accurately identifies punctuation and special characters as separate tokens. It correctly separates the dollar sign from the numeral and recognizes periods as standalone tokens. This nuanced tokenization is crucial for many NLP tasks, where the precise delineation of words and punctuation can significantly impact the analysis's accuracy and insights.

Given the sentence "It wasn't me, I swear!", what will be the result of applying the split() method on it?

Selecione a resposta correta

Tudo estava claro?

Seção 1. Capítulo 3
some-alt