How to calculate tf-idf vectors

We discussed briefly about the vector space models and TF-IDF in our previous post. In short, TF (Term Frequency) means the number of times a term appears in a given document. IDF (Inverse Document Frequency) means number of documents in which the term appears at least once out of all the documents in the corpus (collection). In the case of IDF, the less documents a term appears in, the more relevant that term becomes.

Let’s work on a example to learn how to calculate tf-idf vectors.
Given below, Table 1 shows the titles of 3 documents. Table 2 shows the number of documents containing each term in the 3 documents. The total number of documents collected is 10000.

Table 1

d1 'New York Times'
d2 'New York Post'
d3 'Los Angeles Times'

Table 2

Term Number of documents
New 500
York 300
Times 600
Los 200
Angeles 400
Post 60

We have

Total number of documents(N) = 10000

We know,

IDFterm = log10(Total number of documents(N)/Number of documents the term appears in)


TFterm = [1+log10(times terms appears in the document)]

... here logarithm is used to reduce the impact of a term due to its frequency. It is a common practice. Alternately it can also be calculated using –

TFterm = (times terms appears in the document)/(total number of terms in the document)

We will use the log one.

Since IDF is universal for a term unlike TF which is different in different documents, let’s calculate IDF of each terms first.

IDFNew = log10(10000/500) = 1.301
IDFYork = log10(10000/300) = 1.523
IDFTimes = log10(10000/600) = 1.222
IDFLos = log10(10000/200) = 1.699
IDFAngeles = log10(10000/400) = 1.398
IDFPost = log10(10000/60) = 2.222

Note: Here we can see, IDF of “post” is the highest because it appears in least number of documents (60). This makes “post” more valuable than other terms.

For d1

Term TF IDF TF-IDF Normalized TF-IDF
New 1 1.301 1.301 0.5545
York 1 1.523 1.523 0.6492
Times 1 1.222 1.222 0.5209
RMS = 2.346

For d2

Term TF IDF TF-IDF Normalized TF-IDF
New 1 1.301 1.301 0.4349
York 1 1.523 1.523 0.5091
Post 1 2.222 2.222 0.7428
RMS = 2.9915

For d3

Term TF IDF TF-IDF Normalized TF-IDF
Los 1 1.699 1.699 0.6750
Angeles 1 1.398 1.398 0.5555
Times 1 1.222 1.222 0.4855
RMS = 2.5168

We will follow up on this in our next article. We will discuss how we can calculate similarities between queries-and-documents and documents-and-documents.