简体繁体中英

Gensim and Annoy for finding similar sentences

原文 2020-02-19 13:45:52 4 1 python/ nlp/ gensim

I have a large number of sentences in a database and I want to find the most similar of those sentences to a single sentence that the user types in.

It looks like I may be able to do this with annoy and gensim , but all the examples I can see are using word2vec which I believe is good for finding single similar words, but not for sentences. However, I note that the AnnoyIndexer() can take a word2vec OR a doc2vec model.

Am I correct that the process is the same, but swapping the word2vec model with a doc2vec model and using a doc2vec vector of the search sentence?

Do I need to use pre-trained word embeddings in any way, or do I literally just train the doc2vec model with the corpus of sentences that I have in my database?

Thank you!

1 answers

Doc2Vec does not require any pre-trained word-vectors: you just train it on your corpus, and it learns what it needs.

For comparing sentences, you could also try calculating a per-sentence vector that's the sum or average of all its words' word-vectors.

If the sentences aren't too long, you could also consider "Word Mover's Distance", available from gensim word-vectors as the .wmdistance(word_list, word_list) . (It's far more expensive to calculate these pairwise distances, than the simple similarity between 2 fixed-length vectors – but it may capture human perception of similarity better.)

Note that ANNOY is just an indexing optimization, which gains speed at the cost of precision. It's only necessary if the brute-force way of finding the .most_similar() – calculating all similarities then sorting to find the top-N – is too slow. It will use more memory for indexing, and accept a risk fo sometimes not finding the exact true nearest-neighbors, for speed.

Gensim find topics in sentences

How to load sentences into Python gensim?

fixed-size topics vector in gensim LDA topic modelling for finding similar texts

Gensim sentences from ontology corpus Unicode error

Unable to tokenize sentences using gensim and nltk in python

Suggesting similar sentences

Remove similar sentences in a list

DeprecationWarning in Gensim `most_similar`?

finding occurrences in Multiple sentences

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Finding most similar sentences among all in python Gensim find topics in sentences How to load sentences into Python gensim? fixed-size topics vector in gensim LDA topic modelling for finding similar texts Gensim sentences from ontology corpus Unicode error Unable to tokenize sentences using gensim and nltk in python Suggesting similar sentences Remove similar sentences in a list DeprecationWarning in Gensim `most_similar`? finding occurrences in Multiple sentences

Related Tags

Gensim and Annoy for finding similar sentences

Question

1 answers

solution1 2 2020-02-19 21:40:11

solution1
2 2020-02-19 21:40:11