简体   繁体   中英

Sklearn cosine_similarity between a tfidf vector and an array of tfidf vectors

I'm trying to get the cosine similarity between a text and the texts contained on an array.

I have been working over this code:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

text1 = 'Hola me llamo Luis'
text2 = 'Ayer Juan se compró una casa'
text3 = 'Casiguagua está más gordo que un manatí'
text4 = 'Y encima le huelen los pies'
text5 = 'HOlA ME LLAMO PEPE'

tweets = [text1, text2, text3, text4]

vectorizer = TfidfVectorizer(max_features=10000)

text1_vector = vectorizer.transform([text1])
text2_vector = vectorizer.transform([text2])
text3_vector = vectorizer.transform([text3])
text4_vector = vectorizer.transform([text4])
text5_vector = vectorizer.transform([text5])

buffer = []


similarity = cosine_similarity(text5_vector.reshape(1,-1), buffer)

My vectors type are:


So I guess I will have to pass my buffer to a csr_matrix, but I don't know how to do this.

I have also been trying to initialize my buffer as a np.array([]) object, but I don't achieve to add the vectors to the buffer later. Any idea what am I failing on?

You can't append sparse rows to a numpy array , what you can do is to stack dense numpy arrays like this using vstack and toarray :

buffer = np.vstack([text1_vector.toarray(),

similarity = cosine_similarity(text5_vector.toarray(), buffer)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM