简体   繁体   English

如何使用 Wiki:Fasttext.vec 和 Google News:Word2vec.bin 预训练文件作为 Keras 嵌入层的权重

[英]How to use Wiki: Fasttext.vec and Google News: Word2vec.bin pre trained files as weights for Keras Embedding layer

I have a function to extract the pre trained embeddings from GloVe.txt and load them as Kears Embedding Layer weights but how can I do for the same for the given two files?我有一个 function 从GloVe.txt中提取预训练的嵌入并将它们作为Kears Embedding Layer权重加载,但是对于给定的两个文件,我该如何做呢?

This accepted stackoverflow answer gave me aa feel that .vec can be seen as .txt and we might use the same technique to extract the fasttext.vec which we use for glove.txt . 这个公认的 stackoverflow 答案让我觉得.vec可以被视为.txt ,我们可以使用相同的技术来提取我们用于glove.txt fasttext.vec Is my understanding correct?我的理解正确吗?

I went through a lot of blogs and stack answers to find what to do with the binary file?我浏览了很多博客和堆栈答案以找到如何处理二进制文件? And I found in this stack answer that binary or .bin file is the MODEL itself not the embeddings and you can convert the bin file to text file using Gensim .在这个堆栈答案中发现二进制或.bin文件是MODEL本身而不是嵌入,您可以使用Gensim将 bin 文件转换为文本文件。 I think it does something to save the embeddings and we can load the pre trained embeddings just like we load Glove .我认为它可以保存嵌入,我们可以像加载Glove一样加载预训练的嵌入。 Is my understanding correct?我的理解正确吗?

Here is the code to do that.这是执行此操作的代码。 I want to know if I'm on the right path because I could not find a satisfactory answer to my question anywhere.我想知道我是否走在正确的道路上,因为我在任何地方都找不到满意的答案。

     tokenizer.fit_on_texts(data) # tokenizer is Keras Tokenizer()
     vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words
     encoded_docs = tokenizer.texts_to_sequences(data) # data is lists of lists of sentences
     padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')   # max_length is say 30  


     model = KeyedVectors.load_word2vec_format('GoogleNews-vectors-negative300.bin', binary=True) # this will load the binary Word2Vec model
     model.save_word2vec_format('GoogleNews-vectors-negative300.txt', binary=False) # this will save the VECTORS in a text file. Can load it using the below function?


    def load_embeddings(vocab_size,fitted_tokenizer,emb_file_path,emb_dim=300):
        '''
        It can load GloVe.txt for sure. But is it the right way to load paragram.txt, fasttext.vec and word2vec.bin if converted to .txt?
        '''
        embeddings_index = dict()
        f = open(emb_file_path)
        for line in f:
            values = line.split()
            word = values[0]
            coefs = asarray(values[1:], dtype='float32')
            embeddings_index[word] = coefs
        f.close()

        embedding_matrix = zeros((vocab_size, emb_dim))
        for word, i in tokenizer.word_index.items():
            embedding_vector = embeddings_index.get(word)
            if embedding_vector is not None:
                embedding_matrix[i] = embedding_vector
                
        return embedding_matrix

My question is that Can we load the .vec file directly AND can we load the .bin file as I have described above with the given load_embeddings() function?我的问题是,我们可以直接加载.vec文件吗?我们可以像上面描述的那样使用给定的load_embeddings() function 加载.bin文件吗?

I have found the answer to this: Please update if there is any problem.我找到了答案:如果有任何问题,请更新。

class PreProcess():
    # check: https://stackabuse.com/pythons-classmethod-and-staticmethod-explained/ for @staticmethod use
    @staticmethod # You don't have to create an object of this class in order access this method. Preprocess.preprocess_data()
    def preprocess_data(data:list,max_length:int):
        '''
        Method to parse, tokenize, build vocab and padding the text data
        args:
            data: List of all the texts as: ['this is text 1','this is text 2 of different length']
            max_length: maximum length to consider for an individual text entry in data
        out:
            vocab size, fitted tokenizer object, encoded input text and padded input text
        '''
        tokenizer = Tokenizer() # set num_words, oov_token arguments depending on your usecase
        tokenizer.fit_on_texts(data)
        vocab_size = len(tokenizer.word_index) + 1 # extra 1 for unknown words which will be all 0s when loading pre trained embeddings
        encoded_docs = tokenizer.texts_to_sequences(data)
        padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')  
        return vocab_size,tokenizer,encoded_docs,padded_docs
    
    
    @staticmethod
    def load_pretrained_embeddings(fitted_tokenizer,vocab_size:int,emb_file:str,emb_dim:int=300,):
    '''
    All 300D Embeddings: https://www.kaggle.com/reppy4620/embeddings 
    '''
    if '.bin' in emb_file: # if it is binary file, it is not embeddings but the MODEL itself. It could be fasttext or word2vec model
        model = KeyedVectors.load_word2vec_format(emb_file, binary=True)
        # emb_file = emb_file.replace('.bin','.txt') # general purpose path
        emb_file = './new_emb_file.txt' # for Kaggle because you have to save data in out dir only
        model.save_word2vec_format(emb_file, binary=False)
    
    # open and read the contents of the .txt / .vec file (.vec is same as .txt file)
    embeddings_index = dict() 
    with open(emb_file,encoding="utf8",errors='ignore') as f:
        for i,line in enumerate(f): # each line is as: hello 0.9 0.3 0.5 0.01 0.001 ...
            if i>0: # why this? You'll see in most of the Kaggle Kernals as if len(line)>100. It is because there is a difference between GloVe style and Word2Vec style embeddings
                # check this link: https://radimrehurek.com/gensim/scripts/glove2word2vec.html

                values = line.split(' ') 
                word = values[0] # first value is "hello" 
                coefs = np.asarray(values[1:], dtype='float32') # everything else is vector of "hello"
                embeddings_index[word] = coefs
                
    # create the embedding matrix or Embedding weights based on your data
    embedding_matrix = np.zeros((vocab_size, emb_dim)) # build embeddings based on our vocab size
    for word, i in fitted_tokenizer.word_index.items(): # get each vocab token one by one
        embedding_vector = embeddings_index.get(word) # get from loaded embeddings
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector # if it is present, just replace the corresponding vectors
            
    return embedding_matrix

             
    @staticmethod
    def load_ELMO(data):
        pass
    
    

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Keras中将自己的词嵌入与像word2vec这样的预训练嵌入一起使用 - How to use own word embedding with pre-trained embedding like word2vec in Keras 如何将word2vec嵌入作为Keras嵌入层传递? - How to pass word2vec embedding as a Keras Embedding layer? 在 Keras 中使用 fasttext 预训练模型作为嵌入层 - Using fasttext pre-trained models as an Embedding layer in Keras 如何在不手动下载模型的情况下访问/使用Google预先训练的Word2Vec模型? - How to access/use Google's pre-trained Word2Vec model without manually downloading the model? Gensim 的 Doc2Vec - 如何使用预训练的 word2vec(词相似性) - Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities) 如何使用预训练的模型权重初始化新的 word2vec 模型? - How to initialize a new word2vec model with pre-trained model weights? 使用fasttext预训练的单词向量作为在tensorflow脚本中的嵌入 - Use of fasttext Pre-trained word vector as embedding in tensorflow script 如何从预训练的词嵌入数据集创建 Keras 嵌入层? - How do I create a Keras Embedding layer from a pre-trained word embedding dataset? 如何从word2vec的Google预训练模型中提取单词向量? - How to extract a word vector from the Google pre-trained model for word2vec? 如何使用word2vec嵌入设计word-RNN模型的输出层 - How to design the output layer of word-RNN model with use word2vec embedding
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM