Normalize cosine similarity values calculated based on tf-idf

Question

I compute cosine similarity based tf-idf matrix :

tfidf_vectorizer_desc = TfidfVectorizer(min_df=5, max_df=0.8, use_idf=True, smooth_idf=True, sublinear_tf=False, tokenizer=tokenize_and_stem)
%time tfidf_matrix_desc = tfidf_vectorizer_desc.fit_transform(descriptions) #fit the vectorizer to text
sim_desc = cosine_similarity(tfidf_matrix_desc)

However, sim_desc contains similarities more than 1.0 (see below). As far I know, cosine_similarity returns values between 0 to 1 scale. In this case, do I need to normalize the cosine similarity scores?

sim_desc = cosine_similarity(tfidf_matrix_desc)
print(np.where(sim_desc < 0 ))
print(np.where(sim_desc > 1))
print(format(np.amax(sim_desc), '.20g'),format(np.amin(sim_desc), '.20g'))

(array([], dtype=int64), array([], dtype=int64))
(array([   0,    0,    0, ..., 1496, 1496, 1497]), array([   0,    1,  735, ..., 1495, 1496, 1497]))
1.0000000000000006661 0

Answer 1

You haven't specified the library you are using so I can't answer if you need to normalize the cosine similarity score.

However, here is the fact:

The cosine similarity actually returns values between -1 to +1. If two vectors are completely an 180 degree opposite, the cosine similarity is -1.

Reference: http://blog.christianperone.com/2013/09/machine-learning-cosine-similarity-for-vector-space-models-part-iii/

Normalize cosine similarity values calculated based on tf-idf

Question

1 answers

solution1
0 2017-03-14 11:28:58

Normalize cosine similarity values calculated based on tf-idf

Question

1 answers

solution1 0 2017-03-14 11:28:58

solution1
0 2017-03-14 11:28:58