Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
Tokenization Using Regular Expressions | Text Preprocessing Fundamentals
course content

Course Content

Introduction to NLP

Tokenization Using Regular ExpressionsTokenization Using Regular Expressions

Why Regular Expressions?

While the word_tokenize() and sent_tokenize() functions from the NLTK library offer convenient ways to tokenize text into words and sentences, they might not always suit specific text processing needs, so let's explore an alternative approach: tokenization using regular expressions (regex).

To recap, regular expressions are a sequence of characters that define a search pattern. They can be used for various text processing tasks, including searching, replacing, and splitting text based on specific patterns.

In the context of tokenization, regex allows for defining custom patterns that can identify tokens, offering more control over the tokenization process than pre-built functions.

If you don't know the basics of regular expressions, feel free to take our Python Basics: Regex Wizards project.

Using regexp_tokenize()

Luckily, the NLTK library includes the regexp_tokenize() function in the tokenize module, which tokenizes a string into substrings using a regular expression. This function is particularly useful when you need to tokenize text based on patterns that are not well-handled by the standard tokenizers.

The most important parameters of regexp_tokenize() are its first two parameters: text (the string to be tokenized) and pattern (regular expression pattern).

Let's take a look at an example:

As you can see, the process is similar to using the word_tokenize() function, however, the results may vary depending on the pattern. In our example, the pattern '\w+' is used to match sequences of alphanumeric characters (letters and numbers), specifically one or more alphanumeric characters.

This results in a list of words without punctuation marks, which differs from word_tokenize() in that the latter typically includes punctuation as separate tokens. Thus, the output of our regexp_tokenize example would be a list of words from the sentence.

Using RegexpTokenizer

An alternative approach for custom tokenization involves using the RegexpTokenizer class from the NLTK's tokenize module. To begin, create an instance of RegexpTokenizer, providing it with your desired regular expression pattern as an argument; this pattern defines how the text will be tokenized.

Unlike the regexp_tokenize() function, you do not supply the text to be tokenized at the time of the RegexpTokenizer instance creation. Instead, once the instance is created with the specified pattern, you utilize its tokenize() method to apply the tokenization on your text, passing the text you wish to tokenize as an argument to this method.

Here is an example:

This approach yields the same results, and it can be better in cases where you need one tokenizer for different texts, as it allows you to create the tokenizer once and then apply it to various text inputs without redefining the pattern each time.

Let's proceed with another example. Suppose we want only digits to be our tokens, then our pattern '\d+' will search for one or more digits, as in the example below:

Overall, regexp tokenization allows for highly customized tokenization, making it ideal for handling complex patterns and specific tokenization rules not easily managed by standard methods like word_tokenize(). In our example, when we wanted to use numbers as tokens, word_tokenize() would not be suitable for this task.

Everything was clear?

Section 1. Chapter 5
course content

Course Content

Introduction to NLP

Tokenization Using Regular ExpressionsTokenization Using Regular Expressions

Why Regular Expressions?

While the word_tokenize() and sent_tokenize() functions from the NLTK library offer convenient ways to tokenize text into words and sentences, they might not always suit specific text processing needs, so let's explore an alternative approach: tokenization using regular expressions (regex).

To recap, regular expressions are a sequence of characters that define a search pattern. They can be used for various text processing tasks, including searching, replacing, and splitting text based on specific patterns.

In the context of tokenization, regex allows for defining custom patterns that can identify tokens, offering more control over the tokenization process than pre-built functions.

If you don't know the basics of regular expressions, feel free to take our Python Basics: Regex Wizards project.

Using regexp_tokenize()

Luckily, the NLTK library includes the regexp_tokenize() function in the tokenize module, which tokenizes a string into substrings using a regular expression. This function is particularly useful when you need to tokenize text based on patterns that are not well-handled by the standard tokenizers.

The most important parameters of regexp_tokenize() are its first two parameters: text (the string to be tokenized) and pattern (regular expression pattern).

Let's take a look at an example:

As you can see, the process is similar to using the word_tokenize() function, however, the results may vary depending on the pattern. In our example, the pattern '\w+' is used to match sequences of alphanumeric characters (letters and numbers), specifically one or more alphanumeric characters.

This results in a list of words without punctuation marks, which differs from word_tokenize() in that the latter typically includes punctuation as separate tokens. Thus, the output of our regexp_tokenize example would be a list of words from the sentence.

Using RegexpTokenizer

An alternative approach for custom tokenization involves using the RegexpTokenizer class from the NLTK's tokenize module. To begin, create an instance of RegexpTokenizer, providing it with your desired regular expression pattern as an argument; this pattern defines how the text will be tokenized.

Unlike the regexp_tokenize() function, you do not supply the text to be tokenized at the time of the RegexpTokenizer instance creation. Instead, once the instance is created with the specified pattern, you utilize its tokenize() method to apply the tokenization on your text, passing the text you wish to tokenize as an argument to this method.

Here is an example:

This approach yields the same results, and it can be better in cases where you need one tokenizer for different texts, as it allows you to create the tokenizer once and then apply it to various text inputs without redefining the pattern each time.

Let's proceed with another example. Suppose we want only digits to be our tokens, then our pattern '\d+' will search for one or more digits, as in the example below:

Overall, regexp tokenization allows for highly customized tokenization, making it ideal for handling complex patterns and specific tokenization rules not easily managed by standard methods like word_tokenize(). In our example, when we wanted to use numbers as tokens, word_tokenize() would not be suitable for this task.

Everything was clear?

Section 1. Chapter 5
some-alt