简体   繁体   中英

Cosine Similarity and TS-SS similarity among documents using tf-idf - Python

A common way of calculating the cosine similarity between text based documents is to calculate tf-idf and then calculating the linear kernel of the tf-idf matrix.

TF-IDF matrix is calculated using TfidfVectorizer().

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(stop_words='english')
tfidf_matrix_content = tfidf.fit_transform(article_master['stemmed_content'])

Here article_master is a dataframe containing the text content of all the documents.
As explained by Chris Clark here , TfidfVectorizer produces normalised vectors; hence the linear_kernel results can be used as cosine similarity.

cosine_sim_content = linear_kernel(tfidf_matrix_content, tfidf_matrix_content)


This is where my confusion lies.

Effectively the cosine similarity between 2 vectors is:

InnerProduct(vec1,vec2) / (VectorSize(vec1) * VectorSize(vec2))

Linear kernel calculates the InnerProduct as stated here

线性核公式

So the questions are:

  1. Why am I not divding the inner product with the product of the magnitude of the vectors?

  2. Why does the normalisation exempt me of this requirement?

  3. Now if I wanted to calculate ts-ss similarity, could I still use the normalised tf-idf matrix and the cosine values (calculated by linear kernel only)?

Thanks to @timleathart 's answer here I finally know the reason.

Normalised vectors have magnitude 1, so it doesn't matter if you explicitly divide by the magnitudes or not. It's mathematically equivalent either way.

The tf-idf vectoriser normalises the individual rows (vectors) so that they are all of length 1. Since cosine similarity is only concerned with the angle, the magnitude difference of the vectors does not matter.

The prime reason behind using ts-ss is to take into account both the angle and the difference in magnitude of the vectors. Hence even though there is nothing wrong in using normalised vectors; however, that beats the whole purpose of using Triangle Similarity component.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM