How to calculate tf-idf vectors

We discussed briefly about the vector space models and TF-IDF in our previous post. In short, TF (Term Frequency) means the number of times a term appears in a given document. IDF (Inverse Document Frequency) means number of documents in which the term appears at least once out of all the documents in the corpus (collection). In the case of IDF, the less documents a term appears in, the more relevant that term becomes.

Let’s work on a example to learn how to calculate tf-idf vectors.
Given below, Table 1 shows the titles of 3 documents. Table 2 shows the number of documents containing each term in the 3 documents. The total number of documents collected is 10000.

Table 1

	Documents
d1	'New York Times'
d2	'New York Post'
d3	'Los Angeles Times'

Table 2

Term	Number of documents
New	500
York	300
Times	600
Los	200
Angeles	400
Post	60

We have

Total number of documents(N) = 10000

We know,

IDFterm = log10(Total number of documents(N)/Number of documents the term appears in)

and

TFterm = [1+log10(times terms appears in the document)]

... here logarithm is used to reduce the impact of a term due to its frequency. It is a common practice. Alternately it can also be calculated using –

TFterm = (times terms appears in the document)/(total number of terms in the document)

We will use the log one.

Since IDF is universal for a term unlike TF which is different in different documents, let’s calculate IDF of each terms first.

IDFNew = log10(10000/500) = 1.301
IDFYork = log10(10000/300) = 1.523
IDFTimes = log10(10000/600) = 1.222
IDFLos = log10(10000/200) = 1.699
IDFAngeles = log10(10000/400) = 1.398
IDFPost = log10(10000/60) = 2.222

Note: Here we can see, IDF of “post” is the highest because it appears in least number of documents (60). This makes “post” more valuable than other terms.

For d1

Term	TF	IDF	TF-IDF	Normalized TF-IDF
New	1	1.301	1.301	0.5545
York	1	1.523	1.523	0.6492
Times	1	1.222	1.222	0.5209
			RMS = 2.346

For d2

Term	TF	IDF	TF-IDF	Normalized TF-IDF
New	1	1.301	1.301	0.4349
York	1	1.523	1.523	0.5091
Post	1	2.222	2.222	0.7428
			RMS = 2.9915

For d3

Term	TF	IDF	TF-IDF	Normalized TF-IDF
Los	1	1.699	1.699	0.6750
Angeles	1	1.398	1.398	0.5555
Times	1	1.222	1.222	0.4855
			RMS = 2.5168

We will follow up on this in our next article. We will discuss how we can calculate similarities between queries-and-documents and documents-and-documents.