Course Content
Introduction to NLP
Introduction to NLP
Implementing TF-IDF
Default Implementation
The implementation of the TF-IDF model in sklearn
is similar to that of the Bag of Words model. To train this model on a corpus, we use the TfidfVectorizer
class utilizing the already familiar to us method .fit_transform()
.
Let's take a look at an example:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
As you can see, aside from using a different class, the rest of the implementation is identical to that of the Bag of Words model. By default, the TF-IDF matrix is computed, as described in the previous chapter, with L2 normalization.
Customizing TF-IDF
Once again, similar to CountVectorizer
, we can specify the min_df
and max_df
parameters to include only terms that occur in at least min_df
documents and at most max_df
documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.
Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df
and max_df
to 2:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
To specify the n-grams to include in our matrix, we can use the ngram_range
parameter. Let's include only bigrams in the resulting matrix:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.
Swipe to show code editor
Your task is to display the vector for the 'medical' unigram in a TF-IDF model with unigrams, bigrams, and trigrams:
-
Import the
TfidfVectorizer
class to create a TF-IDF model. -
Instantiate the
TfidfVectorizer
class astfidf_vectorizer
that includes both unigrams, bigrams, and trigrams. -
Utilize the appropriate method of
tfidf_vectorizer
to generate a TF-IDF matrix from the'Document'
column in thecorpus
. -
Convert
tfidf_matrix
to a dense array and create aDataFrame
from it, setting the unique features (terms) as its columns. Assign this to the variabletfidf_matrix_df
. -
Display the vector for 'medical' as an array, rather than as a pandas
Series
.
Solution
Thanks for your feedback!
Implementing TF-IDF
Default Implementation
The implementation of the TF-IDF model in sklearn
is similar to that of the Bag of Words model. To train this model on a corpus, we use the TfidfVectorizer
class utilizing the already familiar to us method .fit_transform()
.
Let's take a look at an example:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Create a default TF-IDF model vectorizer = TfidfVectorizer() # Generate a TF-IDF matrix tfidf_matrix = vectorizer.fit_transform(corpus) # Convert a sparse matrix into a DataFrame tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
As you can see, aside from using a different class, the rest of the implementation is identical to that of the Bag of Words model. By default, the TF-IDF matrix is computed, as described in the previous chapter, with L2 normalization.
Customizing TF-IDF
Once again, similar to CountVectorizer
, we can specify the min_df
and max_df
parameters to include only terms that occur in at least min_df
documents and at most max_df
documents. These can be specified as either absolute numbers of documents or as a proportion of the total number of documents.
Here is an example where we include only those terms that appear in exactly 2 documents by setting both min_df
and max_df
to 2:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include terms which appear in exactly 2 documents vectorizer = TfidfVectorizer(min_df=2, max_df=2) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
To specify the n-grams to include in our matrix, we can use the ngram_range
parameter. Let's include only bigrams in the resulting matrix:
from sklearn.feature_extraction.text import TfidfVectorizer import pandas as pd corpus = [ 'Global climate change poses significant risks to global ecosystems.', 'Global warming and climate change demand urgent action.', 'Sustainable environmental practices support environmental conservation.', ] # Include only bigrams vectorizer = TfidfVectorizer(ngram_range=(2, 2)) tfidf_matrix = vectorizer.fit_transform(corpus) tfidf_matrix_df = pd.DataFrame(tfidf_matrix.toarray(), columns=vectorizer.get_feature_names_out()) print(tfidf_matrix_df)
These are the most commonly used parameters, however, in case you want to explore more of them, you can refer to the documentation.
Swipe to show code editor
Your task is to display the vector for the 'medical' unigram in a TF-IDF model with unigrams, bigrams, and trigrams:
-
Import the
TfidfVectorizer
class to create a TF-IDF model. -
Instantiate the
TfidfVectorizer
class astfidf_vectorizer
that includes both unigrams, bigrams, and trigrams. -
Utilize the appropriate method of
tfidf_vectorizer
to generate a TF-IDF matrix from the'Document'
column in thecorpus
. -
Convert
tfidf_matrix
to a dense array and create aDataFrame
from it, setting the unique features (terms) as its columns. Assign this to the variabletfidf_matrix_df
. -
Display the vector for 'medical' as an array, rather than as a pandas
Series
.
Solution
Thanks for your feedback!