Tokenization Using Regular Expressions

Why Regular Expressions?

While the word_tokenize() and sent_tokenize() functions from the NLTK library offer convenient ways to tokenize text into words and sentences, they might not always suit specific text processing needs, so let's explore an alternative approach: tokenization using regular expressions (regex).

In the context of tokenization, regex allows for defining custom patterns that can identify tokens, offering more control over the tokenization process than pre-built functions.

Using regexp_tokenize()

Luckily, the NLTK library includes the regexp_tokenize() function in the tokenize module, which tokenizes a string into substrings using a regular expression. This function is particularly useful when you need to tokenize text based on patterns that are not well-handled by the standard tokenizers.

The most important parameters of regexp_tokenize() are its first two parameters: text (the string to be tokenized) and pattern (regular expression pattern).

Let's take a look at an example:


              123456
            
from nltk.tokenize import regexp_tokenize
text = "Let's try, regex tokenization. Does it work? Yes, it does!"
text = text.lower()
# Tokenize a sentence
tokens = regexp_tokenize(text, r'\w+')
print(tokens)

As you can see, the process is similar to using the word_tokenize() function, however, the results may vary depending on the pattern. In our example, the pattern '\w+' is used to match sequences of alphanumeric characters (letters and numbers), specifically one or more alphanumeric characters.

This results in a list of words without punctuation marks, which differs from word_tokenize() in that the latter typically includes punctuation as separate tokens. Thus, the output of our regexp_tokenize example would be a list of words from the sentence.

Using RegexpTokenizer

An alternative approach for custom tokenization involves using the RegexpTokenizer class from the NLTK's tokenize module. To begin, create an instance of RegexpTokenizer, providing it with your desired regular expression pattern as an argument; this pattern defines how the text will be tokenized.

Unlike the regexp_tokenize() function, you do not supply the text to be tokenized at the time of the RegexpTokenizer instance creation. Instead, once the instance is created with the specified pattern, you utilize its tokenize() method to apply the tokenization on your text, passing the text you wish to tokenize as an argument to this method.

Here is an example:


              12345678
            
from nltk.tokenize import RegexpTokenizer
# Define a tokenizer with a regular expression
tokenizer = RegexpTokenizer(r'\w+')
text = "Let's try, regex tokenization. Does it work? Yes, it does!"
text = text.lower()
# Tokenize a sentence
tokens = tokenizer.tokenize(text)
print(tokens)

This approach yields the same results, and it can be better in cases where you need one tokenizer for different texts, as it allows you to create the tokenizer once and then apply it to various text inputs without redefining the pattern each time.

Let's proceed with another example. Suppose we want only digits to be our tokens, then our pattern '\d+' will search for one or more digits, as in the example below:


              1234567
            
from nltk.tokenize import RegexpTokenizer
tokenizer = RegexpTokenizer(r'\d+')
text = "Give my 100$ back right now or 20 each month"
text = text.lower()
# Tokenize a sentence
tokens = tokenizer.tokenize(text)
print(tokens)

Overall, regexp tokenization allows for highly customized tokenization, making it ideal for handling complex patterns and specific tokenization rules not easily managed by standard methods like word_tokenize(). In our example, when we wanted to use numbers as tokens, word_tokenize() would not be suitable for this task.

Everything was clear?

Thanks for your feedback!

Section 1. Chapter 5

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to NLP