Notice: This page requires JavaScript to function properly.
Please enable JavaScript in your browser settings or update your browser.
TF-IDF | Basic Text Models
course content

Course Content

Introduction to NLP

TF-IDFTF-IDF

Understanding TF-IDF

The bag of words model due to its simplicity has drawbacks, as terms that occur frequently across all documents may can overshadow less frequent but more informative ones, which may be more effective as features for identifying specific categories or distinguishing the documents. To address this, the TF-IDF model is often used instead.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word to a document in the context of a corpus.

Unlike BoW's focus on absolute term frequencies, TF-IDF considers both a term's frequency within a document and its inverse frequency across the entire corpus. This helps reduce the weight of overly common terms, amplifying the significance of rarer but potentially more informative ones.

How TF-IDF Works

Essentially, the TF-IDF score for a term in a document is computed as follows:

tf-idf(t, d) = tf(t, d) × idf(t)

where t is a particular term (word or n-gram) and d is a particular document.

TF-IDF formulas

Let's now break down the formulas for tf and idf:

  • Term Frequency (TF): Calculated as the count of a term in a document, count(t, d). It measures a term's importance within a specific document.
  • Inverse Document Frequency (IDF): Calculated as the natural logarithm (you can read about it here) of the total number of documents plus one, 1 + N_documents, divided by the document frequency of the term plus one, 1 + df(t), and then incremented by 1. This adjustment prevents division by zero for absent terms and ensures non-zero IDF values for terms present in all documents, thus maintaining their influence in the TF-IDF score. Overall, IDF reduces the weight of terms common across the corpus.

As you can see, if we only used TF without IDF, we would simply get a frequency-based bag of words.

Calculating TF-IDF

Let's now take a look at an example:

TF-IDF calculation

Here, we have only two documents and use exclusively unigrams (words), so the calculations should be straightforward. First, we calculate the term frequencies for each term in each document. Then, we compute the IDF values for the terms 'a' and 'is'.

Since there are only two documents in our corpus, these are the only possible IDF values for the terms within this corpus. Every term that appears in both documents will have an IDF value of 1, while other terms will have an IDF value of approximately 1.406465.

Finally, we can compute the TF-IDF values for each term in each document multiplying TF by IDF, resulting in the following matrix:

TF-IDF Matrix

L2 Normalization

The resulting TF-IDF document vectors, especially in large text corpora, can greatly vary in magnitude due to differences in document length, hence why L2 normalization is essential to scale these vectors to a uniform length, allowing for accurate comparisons of textual similarity that are not biased by the size of the documents.

L2 normalization, also known as Euclidean normalization, is a process applied to individual vectors that adjusts their values to ensure that the length of the vector is 1.

The L2 normalization is done by dividing each term in the vector by the Euclidean norm of the vector. The Euclidean norm (or L2 norm) of a vector is the square root of the sum of the squares of its components.

Here is how work L2 normalization works for a 2-dimensional vector (a document with 2 terms):

L2 normalization
Don't worry if this seems a bit complicated to you; what we do is simply divide each TF-IDF value in the document by the magnitude of the TF-IDF vector of this document.

Let's now apply L2 normalization for our TF-IDF matrix, which we calculated above:

TF-IDF matrix

The resulting matrix is exactly what we had as an example in one of the previous chapters.

What is the key advantage of the TF-IDF model in comparison to the BoW model?

Select the correct answer

Everything was clear?

Section 3. Chapter 6
course content

Course Content

Introduction to NLP

TF-IDFTF-IDF

Understanding TF-IDF

The bag of words model due to its simplicity has drawbacks, as terms that occur frequently across all documents may can overshadow less frequent but more informative ones, which may be more effective as features for identifying specific categories or distinguishing the documents. To address this, the TF-IDF model is often used instead.

TF-IDF (Term Frequency-Inverse Document Frequency) is a statistical measure used to evaluate the importance of a word to a document in the context of a corpus.

Unlike BoW's focus on absolute term frequencies, TF-IDF considers both a term's frequency within a document and its inverse frequency across the entire corpus. This helps reduce the weight of overly common terms, amplifying the significance of rarer but potentially more informative ones.

How TF-IDF Works

Essentially, the TF-IDF score for a term in a document is computed as follows:

tf-idf(t, d) = tf(t, d) × idf(t)

where t is a particular term (word or n-gram) and d is a particular document.

TF-IDF formulas

Let's now break down the formulas for tf and idf:

  • Term Frequency (TF): Calculated as the count of a term in a document, count(t, d). It measures a term's importance within a specific document.
  • Inverse Document Frequency (IDF): Calculated as the natural logarithm (you can read about it here) of the total number of documents plus one, 1 + N_documents, divided by the document frequency of the term plus one, 1 + df(t), and then incremented by 1. This adjustment prevents division by zero for absent terms and ensures non-zero IDF values for terms present in all documents, thus maintaining their influence in the TF-IDF score. Overall, IDF reduces the weight of terms common across the corpus.

As you can see, if we only used TF without IDF, we would simply get a frequency-based bag of words.

Calculating TF-IDF

Let's now take a look at an example:

TF-IDF calculation

Here, we have only two documents and use exclusively unigrams (words), so the calculations should be straightforward. First, we calculate the term frequencies for each term in each document. Then, we compute the IDF values for the terms 'a' and 'is'.

Since there are only two documents in our corpus, these are the only possible IDF values for the terms within this corpus. Every term that appears in both documents will have an IDF value of 1, while other terms will have an IDF value of approximately 1.406465.

Finally, we can compute the TF-IDF values for each term in each document multiplying TF by IDF, resulting in the following matrix:

TF-IDF Matrix

L2 Normalization

The resulting TF-IDF document vectors, especially in large text corpora, can greatly vary in magnitude due to differences in document length, hence why L2 normalization is essential to scale these vectors to a uniform length, allowing for accurate comparisons of textual similarity that are not biased by the size of the documents.

L2 normalization, also known as Euclidean normalization, is a process applied to individual vectors that adjusts their values to ensure that the length of the vector is 1.

The L2 normalization is done by dividing each term in the vector by the Euclidean norm of the vector. The Euclidean norm (or L2 norm) of a vector is the square root of the sum of the squares of its components.

Here is how work L2 normalization works for a 2-dimensional vector (a document with 2 terms):

L2 normalization
Don't worry if this seems a bit complicated to you; what we do is simply divide each TF-IDF value in the document by the magnitude of the TF-IDF vector of this document.

Let's now apply L2 normalization for our TF-IDF matrix, which we calculated above:

TF-IDF matrix

The resulting matrix is exactly what we had as an example in one of the previous chapters.

What is the key advantage of the TF-IDF model in comparison to the BoW model?

Select the correct answer

Everything was clear?

Section 3. Chapter 6
some-alt