I'm trying to write a Python method to efficiently return the n closest words to a given word, based on their respective embedding vectors. Each vector is 200 dimensions, and there are a couple million of them.
Here's what I have at the moment, which simply does a cosine similarity comparison against the target word and every other word. This is very, very slow:
def n_nearest_words(word, n, word_vectors):
"""
Return a list of the n nearest words to param word, based on cosine similarity
param word_vectors: dict, keys are words and values are vectors
"""
# get_word_vector() finds the word in the word_vectors dict, using a number of
# possible capitalizations. Returns None if not found
word_vector = get_word_vector(word, word_vectors)
if word_vector:
word_vector = word_vector.reshape((1, -1))
sorted_by_sim = sorted(
word_vectors.keys(),
key=lambda other_word: cosine_similarity(word_vector, word_vectors[other_word].reshape((1, -1))),
reverse=True)
return sorted_by_sim[1:n + 1] # ignore first item, which should be target word itself
return list()
Does anybody have any better suggestions?
也许尝试将两个单词之间的距离存储在一个dict字典中,这样您就可以在看过一次单词之后查找它们。
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.