## How to calculate tf-idf vectors

﻿

We discussed briefly about the vector space models and TF-IDF in our previous post. In short, TF (Term Frequency) means the number of times a term appears in a given document. IDF (Inverse Document Frequency) means number of documents in which the term appears at least once out of all the documents in the corpus (collection). In the case of IDF, the less documents a term appears in, the more relevant that term becomes.

Let’s work on a example to learn how to calculate tf-idf vectors.
Given below, Table 1 shows the titles of 3 documents. Table 2 shows the number of documents containing each term in the 3 documents. The total number of documents collected is 10000.

Table 1

Documents
d1'New York Times'
d2'New York Post'
d3'Los Angeles Times'

Table 2

TermNumber of documents
New500
York300
Times600
Los200
Angeles400
Post60

We have

``Total number of documents(N) = 10000``

We know,

``IDFterm = log10(Total number of documents(N)/Number of documents the term appears in)``

and

``TFterm = [1+log10(times terms appears in the document)]``
... here logarithm is used to reduce the impact of a term due to its frequency. It is a common practice. Alternately it can also be calculated using –

``TFterm = (times terms appears in the document)/(total number of terms in the document)``

We will use the log one.

Since IDF is universal for a term unlike TF which is different in different documents, let’s calculate IDF of each terms first.

``````IDFNew = log10(10000/500) = 1.301
IDFYork = log10(10000/300) = 1.523
IDFTimes = log10(10000/600) = 1.222
IDFLos = log10(10000/200) = 1.699
IDFAngeles = log10(10000/400) = 1.398
IDFPost = log10(10000/60) = 2.222``````

Note: Here we can see, IDF of “post” is the highest because it appears in least number of documents (60). This makes “post” more valuable than other terms.

For d1

TermTFIDFTF-IDFNormalized TF-IDF
New11.3011.3010.5545
York11.5231.5230.6492
Times11.2221.2220.5209
RMS = 2.346

For d2

TermTFIDFTF-IDFNormalized TF-IDF
New11.3011.3010.4349
York11.5231.5230.5091
Post12.2222.2220.7428
RMS = 2.9915

For d3

TermTFIDFTF-IDFNormalized TF-IDF
Los11.6991.6990.6750
Angeles11.3981.3980.5555
Times11.2221.2220.4855
RMS = 2.5168

We will follow up on this in our next article. We will discuss how we can calculate similarities between queries-and-documents and documents-and-documents. 