简体   繁体   中英

Gensim and Annoy for finding similar sentences

I have a large number of sentences in a database and I want to find the most similar of those sentences to a single sentence that the user types in.

It looks like I may be able to do this with annoy and gensim , but all the examples I can see are using word2vec which I believe is good for finding single similar words, but not for sentences. However, I note that the AnnoyIndexer() can take a word2vec OR a doc2vec model.

Am I correct that the process is the same, but swapping the word2vec model with a doc2vec model and using a doc2vec vector of the search sentence?

Do I need to use pre-trained word embeddings in any way, or do I literally just train the doc2vec model with the corpus of sentences that I have in my database?

Thank you!

Doc2Vec does not require any pre-trained word-vectors: you just train it on your corpus, and it learns what it needs.

For comparing sentences, you could also try calculating a per-sentence vector that's the sum or average of all its words' word-vectors.

If the sentences aren't too long, you could also consider "Word Mover's Distance", available from gensim word-vectors as the .wmdistance(word_list, word_list) . (It's far more expensive to calculate these pairwise distances, than the simple similarity between 2 fixed-length vectors – but it may capture human perception of similarity better.)

Note that ANNOY is just an indexing optimization, which gains speed at the cost of precision. It's only necessary if the brute-force way of finding the .most_similar() – calculating all similarities then sorting to find the top-N – is too slow. It will use more memory for indexing, and accept a risk fo sometimes not finding the exact true nearest-neighbors, for speed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM