简体繁体 English

如何结合两个预先训练的Word2Vec模型？

[英]How to combine two pre-trained Word2Vec models?

原文 2018-03-23 13:49:15 8 1 java/ nlp/ emoji/ word2vec/ deeplearning4j

I successfully followed deeplearning4j.org tutorial on Word2Vec, so I am able to load already trained model or train a new one based on some raw text (more specifically, I am using GoogleNews-vectors-negative300 and Emoji2Vec pre-trained model). 我成功地遵循了Word2Vec上的deeplearning4j.org教程，因此我能够加载已训练的模型或基于一些原始文本来训练新模型（更具体地说，我正在使用GoogleNews-vectors-negative300和Emoji2Vec预先训练的模型）。

However, I would like to combine these two above models for the following reason: Having a sentence (for example, a comment from Instagram or Twitter, which consists of emoji), I want to identify the emoji in the sentence and then map it to the word it is related to. 但是，由于以下原因，我想将上述两个模型结合起来：拥有一个句子（例如，来自表情符号的Instagram或Twitter的评论），我想识别句子中的表情符号，然后将其映射到与之相关的词。 In order to do that, I was planning to iterate over all the words in the sentence and calculate the closeness (how near the emoji and the word are located in the vector space). 为了做到这一点，我打算遍历句子中的所有单词并计算紧密度（表情符号和单词在向量空间中的距离）。

I found the code how to uptrain the already existing model. 我找到了如何训练现有模型的代码。 However, it is mentioned that new words are not added in this case and only weights for the existing words will be updated based on a new text corpus. 但是，要提到的是在这种情况下不添加新单词，并且将基于新的文本语料库更新现有单词的权重。

I would appreciate any help or ideas on the problem I have. 对于我遇到的问题，我将不胜感激。 Thanks in advance! 提前致谢！

1 个解决方案

Combining two models trained from different corpuses is not a simple, supported operation in the word2vec libraries with which I'm most familiar. 在我最熟悉的word2vec库中，将不同语料库训练的两个模型结合起来并不是一个简单的受支持的操作。

In particular, even if the same word appears in both corpuses, and even in similar contexts, the randomization that's used by this algorithm during initialization and training, and extra randomization injected by multithreaded training, mean that word may appear in wildly different places. 特别是，即使在两个语料库中都出现相同的单词，甚至在相似的上下文中，该算法在初始化和训练期间使用的随机化，以及多线程训练注入的额外随机化都意味着该单词可能出现在极为不同的位置。 It's only the relative distances/orientation with respect to other words that should be roughly similar – not the specific coordinates/rotations. 相对于其他单词的相对距离/方向应该大致相似，而不是特定的坐标/旋转。

So to merge two models requires translating one's coordinates to the other. 因此，合并两个模型需要将一个坐标转换为另一个坐标。 That in itself will typically involve learning-a-projection from one space to the other, then moving unique words from a source space to the surviving space. 这本身通常将涉及到从一个空间到另一个空间学习投影，然后将唯一的单词从源空间移动到生存空间。 I don't know if DL4J has a built-in routine for this; 我不知道DL4J是否为此内置了例程。 the Python gensim library has a TranslationMatrix example class in recent versions which can do this, as motivated by the use of word-vectors for language-to-language translations. Python gensim库在最新版本中有一个TranslationMatrix示例类，可以实现此目的，这是由于在语言到语言的翻译中使用词向量而引起的。