简体   繁体   中英

Find cosine distance for all pairs of word2vec encodings without using nested loops

I need to calculate and store . 计算和存储Each word is represented as a 4 * 1 vector stored in a pandas dataframe, with each element in the contunuous range [1, 9]. I need to store the result in a pandas dataframe so that it can be accessed in constant time.

I am unable to use the apply function of pandas library/lambda. Using nested loops will take approx. 9 hours (according to tqdm).

word     word1    word2    word3 ...
word1    d11      d12      d13...
word2    d21      d22      d23...
word3    d31      d32      d33...
.
.
.

If you were to use something like the Python gensim library to load a pre-existing vector set (in the original word2vec.c format) into its KeyedVectors representation, then the raw vectors will be in a numpy array in its vectors property. For example:

kv = KeyedVectors.load_word2vec_format('word_vectors.bin', binary=True)
print(kv.vectors.shape)

You could then use a library function like scikit-learn 's pairwise_distances() to compute the distance matrix:

from sklearn.metrics import pairwise_distances
distances = pairwise_distances(kv.vectors, metric="cosine")

Because the sklearn routine uses optimized native math routines, it will likely be a lot faster than your initial loops-in-pure-Python approach. Note, though, that the resulting distances matrix may be huge!

(You can find out which words are in which kv.vectors slots via the list in kv.index2entity , or look up the slot for a word via the dict in kv.vocab .)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM