Find cosine distance for all pairs of word2vec encodings without using nested loops

Question

I need to calculate and store . 计算和存储。 Each word is represented as a 4 * 1 vector stored in a pandas dataframe, with each element in the contunuous range [1, 9]. I need to store the result in a pandas dataframe so that it can be accessed in constant time.

I am unable to use the apply function of pandas library/lambda. Using nested loops will take approx. 9 hours (according to tqdm).

word     word1    word2    word3 ...
word1    d11      d12      d13...
word2    d21      d22      d23...
word3    d31      d32      d33...
.
.
.

Answer 1

If you were to use something like the Python gensim library to load a pre-existing vector set (in the original word2vec.c format) into its KeyedVectors representation, then the raw vectors will be in a numpy array in its vectors property. For example:

kv = KeyedVectors.load_word2vec_format('word_vectors.bin', binary=True)
print(kv.vectors.shape)

You could then use a library function like scikit-learn 's pairwise_distances() to compute the distance matrix:

from sklearn.metrics import pairwise_distances
distances = pairwise_distances(kv.vectors, metric="cosine")

Because the sklearn routine uses optimized native math routines, it will likely be a lot faster than your initial loops-in-pure-Python approach. Note, though, that the resulting distances matrix may be huge!

(You can find out which words are in which kv.vectors slots via the list in kv.index2entity , or look up the slot for a word via the dict in kv.vocab .)

Find cosine distance for all pairs of word2vec encodings without using nested loops

Question

1 answers

solution1
1 ACCPTED 2018-10-06 00:51:18

Find cosine distance for all pairs of word2vec encodings without using nested loops

Question

1 answers

solution1 1 ACCPTED 2018-10-06 00:51:18

solution1
1 ACCPTED 2018-10-06 00:51:18