简体   繁体   中英

Python - Efficiently find n nearest vectors

I'm trying to write a Python method to efficiently return the n closest words to a given word, based on their respective embedding vectors. Each vector is 200 dimensions, and there are a couple million of them.

Here's what I have at the moment, which simply does a cosine similarity comparison against the target word and every other word. This is very, very slow:

def n_nearest_words(word, n, word_vectors):
    """
    Return a list of the n nearest words to param word, based on cosine similarity
    param word_vectors: dict, keys are words and values are vectors
    """
    # get_word_vector() finds the word in the word_vectors dict, using a number of
    # possible capitalizations. Returns None if not found
    word_vector = get_word_vector(word, word_vectors)
    if word_vector:
        word_vector = word_vector.reshape((1, -1))
        sorted_by_sim = sorted(
            word_vectors.keys(),
            key=lambda other_word: cosine_similarity(word_vector, word_vectors[other_word].reshape((1, -1))),
            reverse=True)
        return sorted_by_sim[1:n + 1] # ignore first item, which should be target word itself
    return list()

Does anybody have any better suggestions?

也许尝试将两个单词之间的距离存储在一个dict字典中,这样您就可以在看过一次单词之后查找它们。

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM