I have some documents and I'd like to find the k documents most similar to a selected document. For the sake of a reproducible example, let's say k is 1 and my documents are these
documents = ['Two roads diverged in a yellow wood,',
'And sorry I could not travel both',
'And be one traveler, long I stood',
'And looked down one as far as I could',
'To where it bent in the undergrowth']
Then I think what I want to do is the below. (I'm using CountVectorizer
for transparency and simplicity, even though maybe later I'd want to use Tf-Idf and a hashing vectorizer.)
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
vectorizer = CountVectorizer(analyzer='word')
ft = vectorizer.fit_transform(documents)
one_doc = documents[1]
one_doc_code = vectorizer.transform([one_doc])
doc_match = np.matrix(ft) * np.matrix(one_doc_code.transpose())
and now doc_match
is a column vector with weights that indicate closeness of match (0 = bad match, 1 = perfect match). But in order to do the multiplication, I (in desperation, in the face of element-wise multiplication) converted to a numpy matrix, so now I have this CSR format matrix that doesn't have a todense() member (so I can't just look, not that that would scale beyond my tiny example).
What I think I want now (but haven't been able to figure out so far) is how to say "what are the indices of the top k elements of doc_match?" (even if k is not 1).
If all you want are the indices in doc_match
that have the highest score, you can do:
sorted_indices = np.argsort(doc_match)
doc_match_vals_sorted = doc_match[sorted_indices]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.