简体   繁体   中英

Finding k most similar documents

I have some documents and I'd like to find the k documents most similar to a selected document. For the sake of a reproducible example, let's say k is 1 and my documents are these

documents = ['Two roads diverged in a yellow wood,',
             'And sorry I could not travel both',
             'And be one traveler, long I stood',
             'And looked down one as far as I could',
             'To where it bent in the undergrowth']

Then I think what I want to do is the below. (I'm using CountVectorizer for transparency and simplicity, even though maybe later I'd want to use Tf-Idf and a hashing vectorizer.)

from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

vectorizer = CountVectorizer(analyzer='word')
ft = vectorizer.fit_transform(documents)
one_doc = documents[1]
one_doc_code = vectorizer.transform([one_doc])
doc_match = np.matrix(ft) * np.matrix(one_doc_code.transpose())

and now doc_match is a column vector with weights that indicate closeness of match (0 = bad match, 1 = perfect match). But in order to do the multiplication, I (in desperation, in the face of element-wise multiplication) converted to a numpy matrix, so now I have this CSR format matrix that doesn't have a todense() member (so I can't just look, not that that would scale beyond my tiny example).

What I think I want now (but haven't been able to figure out so far) is how to say "what are the indices of the top k elements of doc_match?" (even if k is not 1).

If all you want are the indices in doc_match that have the highest score, you can do:

sorted_indices = np.argsort(doc_match)
doc_match_vals_sorted = doc_match[sorted_indices]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM