TF-IDF

Understanding TF-IDF

The bag of words model due to its simplicity has drawbacks, as terms that occur frequently across all documents may can overshadow less frequent but more informative ones, which may be more effective as features for identifying specific categories or distinguishing the documents. To address this, the TF-IDF model is often used instead.

Unlike BoW's focus on absolute term frequencies, TF-IDF considers both a term's frequency within a document and its inverse frequency across the entire corpus. This helps reduce the weight of overly common terms, amplifying the significance of rarer but potentially more informative ones.

How TF-IDF Works

Essentially, the TF-IDF score for a term in a document is computed as follows:

tf-idf(t, d) = tf(t, d) × idf(t)

where t is a particular term (word or n-gram) and d is a particular document.

Let's now break down the formulas for tf and idf:

Term Frequency (TF): Calculated as the count of a term in a document, count(t, d). It measures a term's importance within a specific document.
Inverse Document Frequency (IDF): Calculated as the natural logarithm (you can read about it here) of the total number of documents plus one, 1 + N_documents, divided by the document frequency of the term plus one, 1 + df(t), and then incremented by 1. This adjustment prevents division by zero for absent terms and ensures non-zero IDF values for terms present in all documents, thus maintaining their influence in the TF-IDF score. Overall, IDF reduces the weight of terms common across the corpus.

As you can see, if we only used TF without IDF, we would simply get a frequency-based bag of words.

Calculating TF-IDF

Let's now take a look at an example:

Here, we have only two documents and use exclusively unigrams (words), so the calculations should be straightforward. First, we calculate the term frequencies for each term in each document. Then, we compute the IDF values for the terms 'a' and 'is'.

Finally, we can compute the TF-IDF values for each term in each document multiplying TF by IDF, resulting in the following matrix:

L2 Normalization

The resulting TF-IDF document vectors, especially in large text corpora, can greatly vary in magnitude due to differences in document length, hence why L2 normalization is essential to scale these vectors to a uniform length, allowing for accurate comparisons of textual similarity that are not biased by the size of the documents.

The L2 normalization is done by dividing each term in the vector by the Euclidean norm of the vector. The Euclidean norm (or L2 norm) of a vector is the square root of the sum of the squares of its components.

Here is how work L2 normalization works for a 2-dimensional vector (a document with 2 terms):

Let's now apply L2 normalization for our TF-IDF matrix, which we calculated above:

The resulting matrix is exactly what we had as an example in one of the previous chapters.

Everything was clear?

Thanks for your feedback!

Section 3. Chapter 6

Ask AI

Ask anything or try one of the suggested questions to begin our chat

Course Content

Introduction to NLP