We discussed briefly about the vector space models and TF-IDF in our previous post. In short, TF (Term Frequency) means the number of times a term appears in a given document. IDF (Inverse Document Frequency) means number of documents in which the term appears at least once out of all the documents in the corpus (collection). In the case of IDF, the less documents a term appears in, the more relevant that term becomes.
Let’s work on a example to learn how to calculate tf-idf vectors.
Given below, Table 1 shows the titles of 3 documents. Table 2 shows the number of documents containing each term in the 3 documents. The total number of documents collected is 10000.
|d1||'New York Times'|
|d2||'New York Post'|
|d3||'Los Angeles Times'|
|Term||Number of documents|
Total number of documents(N) = 10000
IDFterm = log10(Total number of documents(N)/Number of documents the term appears in)
TFterm = [1+log10(times terms appears in the document)]
... here logarithm is used to reduce the impact of a term due to its frequency. It is a common practice. Alternately it can also be calculated using –
TFterm = (times terms appears in the document)/(total number of terms in the document)
We will use the log one.
Since IDF is universal for a term unlike TF which is different in different documents, let’s calculate IDF of each terms first.
IDFNew = log10(10000/500) = 1.301 IDFYork = log10(10000/300) = 1.523 IDFTimes = log10(10000/600) = 1.222 IDFLos = log10(10000/200) = 1.699 IDFAngeles = log10(10000/400) = 1.398 IDFPost = log10(10000/60) = 2.222
Note: Here we can see, IDF of “post” is the highest because it appears in least number of documents (60). This makes “post” more valuable than other terms.
|RMS = 2.346|
|RMS = 2.9915|
|RMS = 2.5168|
We will follow up on this in our next article. We will discuss how we can calculate similarities between queries-and-documents and documents-and-documents.