使用gensim加载一部分Glove向量

Question

I have a word list like ['like','Python'] and I want to load pre-trained Glove word vectors of these words, but the Glove file is too large, is there any fast way to do it? 我有一个单词列表，如['like','Python'] ，我想加载这些单词的预训练手套单词向量，但手套文件太大，有没有快速的方法呢？

What I tried 我尝试了什么

I iterated through each line of the file to see if the word is in the list and add it to a dict if True. 我遍历文件的每一行，看看该单词是否在列表中，如果为True则将其添加到dict中。 But this method is a little slow. 但这种方法有点慢。

def readWordEmbeddingVector(Wrd):
    f = open('glove.twitter.27B/glove.twitter.27B.200d.txt','r')
    words = []
    a = f.readline()
    while a!= '':
        vector = a.split()
        if vector[0] in Wrd:
            words.append(vector)
            Wrd.remove(vector[0])
        a = f.readline()
    f.close()
    words_vector = pd.DataFrame(words).set_index(0).astype('float')
    return words_vector

I also tried below, but it loaded the whole file instead of vectors I need 我也试过下面，但它加载了整个文件而不是我需要的向量

gensim.models.keyedvectors.KeyedVectors.load_word2vec_format('word2vec.twitter.27B.200d.txt')

What I want 我想要的是

Method like gensim.models.keyedvectors.KeyedVectors.load_word2vec_format but I can set a word list to load. 方法如gensim.models.keyedvectors.KeyedVectors.load_word2vec_format但我可以设置要加载的单词列表。

Answer 1

There's no existing gensim support for filtering the words loaded via load_word2vec_format() . 没有现有的gensim支持来过滤通过load_word2vec_format()加载的单词。 The closest is an optional limit parameter, which can be used to limit how many word-vectors are read (ignoring all subsequent vectors). 最接近的是可选limit参数，可用于限制读取的字向量数（忽略所有后续向量）。

You could conceivably create your own routine to perform such filtering, using the source code for load_word2vec_format() as a model. 您可以设想创建自己的例程来执行此类过滤，使用load_word2vec_format()的源代码作为模型。 As a practical matter, you might have to read the file twice: 1st, to find out exactly how many words in the file intersect with your set-of-words-of-interest (so you can allocate the right-sized array without trusting the declared size at the front of the file), then a second time to actually read the words-of-interest. 实际上，您可能需要读取文件两次：第1，找出文件中有多少单词与您感兴趣的单词集相交（这样您就可以分配正确大小的数组而不信任文件前面的声明大小），然后第二次实际读取感兴趣的单词。

使用gensim加载一部分Glove向量

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-04-21 01:31:11

使用gensim加载一部分Glove向量

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-04-21 01:31:11

解决方案1
0 已采纳 2019-04-21 01:31:11