In our previous post, we discussed about td-df vectors and how to calculate them. Until now, we learned what term frequency and inverse document frequency are, how they impact the relevancy of a document, and how we can calculate them. Now let’s learn how to calculate cosine similarities between queries and documents, and documents and documents.
If you go through the previous post, you will see that we calculate normalized tf-idf vectors. There’s a reason for that. We’ll come to it in a while.
Cosine Similarity
Say, we drew the vectors for some words in a xy-plane for a document and tallied them with vectors for some other documents. If the documents were exactly the same, the vectors would completely overlap. If the documents were similar, the vectors would converge. But they would diverge if the documents are more and more dissimilar. Thus, the similarity can be approximated by finding out the angle between the vectors. By calculating cosine of the angle we get the similarities between 0 to 1. Closer the value to 1, more the similarity (smaller angle).
We know,
Going back to the normalized tf-idf vectors. Absolute distance of a vector is given by root mean squared of the distances.
Hence our normalized vectors are analogous to –
vec{a}/abs(vec{a})
That would mean, cosine angle between two vector d1 and d2 will be ==> normalized tfidf of each term d1 * normalized tfidf of each term in d2
Continuing our example from previous article, let’s find out the similarities between a document “Los Angeles Post” (say, d4) with every other documents.
SIM[d4-d1] = 0.5433 * 0 + 0.4471 * 0 + 0.7106 * 0 = 0
SIM[d4-d2] = 0.5433 * 0 + 0.4471 * 0 + 0.7106 * 0.7428 = 0.5278
SIM[d4-d3] = = 0.5433 * 0.6750 + 0.4471 * 0.5555 + 0.7106 * 0 = 0.6151
Since, the score is highest with d3, we know that “Los Angeles Post” (d4) is most similar to “Los Angeles Times” (d3).
Note: Although “New York Post” and “Los Angeles Post” have only one word in common unlike with “Los Angeles Times” with two common words, it still managed to score very close to “Los Angeles Times”. As you’d have already guessed, it’s because the only common word “Post” still held a lost of weight. Remember “post” was present is least number of documents resulting in highest Inverse Document Frequency (IDF) value!
Similarly, the similarities between a query and document can be calculated.