I need to calculate and store . 计算和存储 。 Each word is represented as a 4 * 1 vector stored in a pandas dataframe, with each element in the contunuous range [1, 9]. I need to store the result in a pandas dataframe so that it can be accessed in constant time.
I am unable to use the apply function of pandas library/lambda. Using nested loops will take approx. 9 hours (according to tqdm).
word word1 word2 word3 ...
word1 d11 d12 d13...
word2 d21 d22 d23...
word3 d31 d32 d33...
.
.
.
If you were to use something like the Python gensim
library to load a pre-existing vector set (in the original word2vec.c format) into its KeyedVectors
representation, then the raw vectors will be in a numpy array in its vectors
property. For example:
kv = KeyedVectors.load_word2vec_format('word_vectors.bin', binary=True)
print(kv.vectors.shape)
You could then use a library function like scikit-learn
's pairwise_distances()
to compute the distance matrix:
from sklearn.metrics import pairwise_distances
distances = pairwise_distances(kv.vectors, metric="cosine")
Because the sklearn
routine uses optimized native math routines, it will likely be a lot faster than your initial loops-in-pure-Python approach. Note, though, that the resulting distances matrix may be huge!
(You can find out which words are in which kv.vectors
slots via the list in kv.index2entity
, or look up the slot for a word via the dict in kv.vocab
.)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.