Gensim 的 Doc2Vec - 如何使用预训练的 word2vec（词相似性）

Question

I don't have large corpus of data to train word similarities eg 'hot' is more similar to 'warm' than to 'cold'.我没有大量的数据来训练词的相似性，例如“热”更类似于“暖”而不是“冷”。 However, I like to train doc2vec on a relatively small corpus ~100 docs so that it can classify my domain specific documents.但是，我喜欢在一个相对较小的语料库上训练 doc2vec，大约 100 个文档，以便它可以对我的领域特定文档进行分类。

To elaborate let me use this toy example.为了详细说明，让我使用这个玩具示例。 Assume I've only 4 training docs given by 4 sentences - "I love hot chocolate.", "I hate hot chocolate.", "I love hot tea.", and "I love hot cake.".假设我只有 4 个由 4 个句子给出的培训文档——“我喜欢热巧克力。”、“我讨厌热巧克力。”、“我喜欢热茶。”和“我喜欢热蛋糕。”。 Given a test document "I adore hot chocolate", I would expect, doc2vec will invariably return "I love hot chocolate."给定一个测试文档“我喜欢热巧克力”，我希望 doc2vec 总是会返回“我喜欢热巧克力”。 as the closest document.作为最接近的文件。 This expectation will be true if word2vec already supplies the knowledge that "adore" is very similar to "love".如果 word2vec 已经提供了“adore”与“love”非常相似的知识，那么这种期望将是正确的。 However, I'm getting most similar document as "I hate hot chocolate" -- which is a bizarre!!然而，我得到的文件与“我讨厌热巧克力”最相似——这太奇怪了！！

Any suggestion on how to circumvent this, ie be able to use pre-trained word embeddings so that I don't need to venture into training "adore" is close to "love", "hate" is close to "detest", and so on.关于如何规避这一点的任何建议，即能够使用预先训练的词嵌入，这样我就不需要冒险训练“爱”接近“爱”，“恨”接近“厌恶”，以及很快。

Code (Jupyter Nodebook. Python 3.7. Jensim 3.8.1)代码（Jupyter Nodebook。Python 3.7。Jensim 3.8.1）

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love hot chocolate.",
        "I hate hot chocolate",
       "I love hot tea.",
       "I love hot cake."]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
print(tagged_data)
#Train and save
max_epochs = 10
vec_size = 5
alpha = 0.025


model = Doc2Vec(vector_size=vec_size, #it was size earlier
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    if epoch % 10 == 0:
        print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs) #It was model.iter earlier
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

print("Model Ready")

test_sentence="I adore hot chocolate"
test_data = word_tokenize(test_sentence.lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)

# to find most similar doc using tags
sims = model.docvecs.most_similar([v1])
print("\nTest: %s\n" %(test_sentence))
for indx, score in sims:
    print("\t(score: %.4f) %s" %(score, data[int(indx)]))

Answer 1

Just ~100 documents is way too small to meaningfully train a Doc2Vec (or Word2Vec ) model.仅约 100 个文档太小，无法有意义地训练Doc2Vec （或Word2Vec ）模型。 Published Doc2Vec work tends to use tens-of-thousands to millions of documents.已发布的Doc2Vec工作往往使用数万到数百万个文档。

To the extent you may be able to get slightly meaningful results from smaller datasets, you'll usually need to reduce the vector-sizes a lot – to far smaller than the number of words/examples – and increase the training epochs.如果您能够从较小的数据集中获得稍微有意义的结果，您通常需要大量减少向量大小——远小于单词/示例的数量——并增加训练时期。 (Your toy data has 4 texts & 6 unique words. Even to get 5-dimensional vectors, you probably want something like 5^2 constrasting documents.) （您的玩具数据有 4 个文本和 6 个独特的词。即使要获得 5 维向量，您也可能需要类似 5^2 的对比文档。）

Also, gensim's Doc2Vec doesn't offer any official option to import word-vectors from elsewhere.此外，gensim 的Doc2Vec不提供任何官方选项来从其他地方导入词向量。 The internal Doc2Vec training is not a process where word-vectors are trained 1st, then doc-vectors calculated.内部Doc2Vec训练不是Doc2Vec训练词向量，然后计算 doc-vectors 的过程。 Rather, doc-vectors & word-vectors are trained in a simultaneous process, gradually improving together.相反，doc-vectors 和 word-vectors 在同时进行训练，一起逐渐改进。 (Some modes, like the fast & often highly effective DBOW that can be enabled with dm=0 , don't create or use word-vectors at all.) （某些模式，例如可以通过dm=0启用的快速且通常非常有效的DBOW ，根本不创建或使用词向量。）

There's not really anything bizarre about your 4-sentence results, when looking at the data as if we were the Doc2Vec or Word2Vec algorithms, which have no prior knowledge about words, only what's in the training data.当我们像Doc2Vec或Word2Vec算法一样查看数据时，您的 4 句结果并没有什么奇怪的Doc2Vec ，它们没有关于单词的先验知识，只有训练数据中的内容。 In your training data, the token 'love' and the token 'hate' are used in nearly exactly the same way, with the same surrounding words.在您的训练数据中，标记'love'和标记'hate'的使用方式几乎完全相同，周围的词也相同。 Only by seeing many subtly varied alternative uses of words, alongside many contrasting surrounding words, can these "dense embedding" models move the word-vectors to useful relative positions, where they are closer to related words & farther from other words.只有通过看到单词的许多微妙变化的替代用法，以及许多对比鲜明的周围单词，这些“密集嵌入”模型才能将单词向量移动到有用的相对位置，在那里它们更接近相关单词并远离其他单词。 (And, since you've provided no training data with the token 'adore' , the model knows nothing about that word – and if it's provided inside a test document, as if to the model's infer_vector() method, it will be ignored. So the test document it 'sees' is only the known words ['i', 'hot', 'chocolate'] .) （而且，由于您没有提供带有标记'adore'训练数据，模型对该词一无所知 - 如果它在测试文档中提供，就像模型的infer_vector()方法一样，它将被忽略。所以它“看到”的测试文档只是已知的单词['i', 'hot', 'chocolate'] 。）

But also, even if you did manage to train on a larger dataset, or somehow inject the knowledge from other word-vectors that 'love' and 'adore' are somewhat similar, it's important to note that antonyms are typically quite similar in sets of word-vectors, too – as they are used in the same contexts, and often syntactically interchangeable, and of the same general category.而且，即使您确实设法在更大的数据集上进行了训练，或者以某种方式从其他词向量中注入了'love'和'adore'有点相似的知识，重要的是要注意反义词在一组中通常非常相似词向量也是如此——因为它们在相同的上下文中使用，并且通常在句法上可以互换，并且属于相同的一般类别。 These models often aren't very good at detecting the flip-in-human-perceived meaning from the swapping of a word for its antonym (or insertion of a single 'not' or other reversing-intent words).这些模型往往不在检测从一个字的交换倒装在人类感知的含义为它的反义词（或单“而不是”或其他逆转意图的话插入）非常好。

Ultimately if you want to use gensim's Doc2Vec , you should train it with far more data.最后，如果你想使用 gensim 的Doc2Vec ，你应该用更多的数据来训练它。 (If you were willing to grab some other pre-trainined word-vectors, why not grab some other source of somewhat-similar bulk sentences? The effect of using data that isn't exactly like your actual problem will be similar whether you leverage that outside data via bulk text or a pre-trained model.) （如果您愿意获取一些其他预先训练好的词向量，为什么不获取一些类似的大量句子的其他来源？使用与您的实际问题不完全相同的数据的效果将是相似的，无论您是否利用它通过批量文本或预训练模型获取外部数据。）

Finally: it's a bad, error-prone pattern to be calling train() more than once in your own loop, with your own alpha adjustments.最后：在您自己的循环中多次调用train()并使用您自己的alpha调整是一种糟糕的、容易出错的模式。 You can just call it once, with the right number of epochs , and the model will perform the multiple training passes & manage the internal alpha smoothly over the right number of epochs.您只需调用一次，使用正确数量的epochs ，模型将执行多次训练并在正确数量的 epochs 内平滑地管理内部alpha 。

Gensim 的 Doc2Vec - 如何使用预训练的 word2vec（词相似性）

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-02-18 20:49:40

Gensim 的 Doc2Vec - 如何使用预训练的 word2vec（词相似性）

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-02-18 20:49:40

解决方案1
1 已采纳 2020-02-18 20:49:40