简体   繁体   中英

Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities)

I don't have large corpus of data to train word similarities eg 'hot' is more similar to 'warm' than to 'cold'. However, I like to train doc2vec on a relatively small corpus ~100 docs so that it can classify my domain specific documents.

To elaborate let me use this toy example. Assume I've only 4 training docs given by 4 sentences - "I love hot chocolate.", "I hate hot chocolate.", "I love hot tea.", and "I love hot cake.". Given a test document "I adore hot chocolate", I would expect, doc2vec will invariably return "I love hot chocolate." as the closest document. This expectation will be true if word2vec already supplies the knowledge that "adore" is very similar to "love". However, I'm getting most similar document as "I hate hot chocolate" -- which is a bizarre!!

Any suggestion on how to circumvent this, ie be able to use pre-trained word embeddings so that I don't need to venture into training "adore" is close to "love", "hate" is close to "detest", and so on.

Code (Jupyter Nodebook. Python 3.7. Jensim 3.8.1)

from gensim.models.doc2vec import Doc2Vec, TaggedDocument
from nltk.tokenize import word_tokenize
data = ["I love hot chocolate.",
        "I hate hot chocolate",
       "I love hot tea.",
       "I love hot cake."]

tagged_data = [TaggedDocument(words=word_tokenize(_d.lower()), tags=[str(i)]) for i, _d in enumerate(data)]
print(tagged_data)
#Train and save
max_epochs = 10
vec_size = 5
alpha = 0.025


model = Doc2Vec(vector_size=vec_size, #it was size earlier
                alpha=alpha, 
                min_alpha=0.00025,
                min_count=1,
                dm =1)

model.build_vocab(tagged_data)

for epoch in range(max_epochs):
    if epoch % 10 == 0:
        print('iteration {0}'.format(epoch))
    model.train(tagged_data,
                total_examples=model.corpus_count,
                epochs=model.epochs) #It was model.iter earlier
    # decrease the learning rate
    model.alpha -= 0.0002
    # fix the learning rate, no decay
    model.min_alpha = model.alpha

print("Model Ready")

test_sentence="I adore hot chocolate"
test_data = word_tokenize(test_sentence.lower())
v1 = model.infer_vector(test_data)
#print("V1_infer", v1)

# to find most similar doc using tags
sims = model.docvecs.most_similar([v1])
print("\nTest: %s\n" %(test_sentence))
for indx, score in sims:
    print("\t(score: %.4f) %s" %(score, data[int(indx)]))

Just ~100 documents is way too small to meaningfully train a Doc2Vec (or Word2Vec ) model. Published Doc2Vec work tends to use tens-of-thousands to millions of documents.

To the extent you may be able to get slightly meaningful results from smaller datasets, you'll usually need to reduce the vector-sizes a lot – to far smaller than the number of words/examples – and increase the training epochs. (Your toy data has 4 texts & 6 unique words. Even to get 5-dimensional vectors, you probably want something like 5^2 constrasting documents.)

Also, gensim's Doc2Vec doesn't offer any official option to import word-vectors from elsewhere. The internal Doc2Vec training is not a process where word-vectors are trained 1st, then doc-vectors calculated. Rather, doc-vectors & word-vectors are trained in a simultaneous process, gradually improving together. (Some modes, like the fast & often highly effective DBOW that can be enabled with dm=0 , don't create or use word-vectors at all.)

There's not really anything bizarre about your 4-sentence results, when looking at the data as if we were the Doc2Vec or Word2Vec algorithms, which have no prior knowledge about words, only what's in the training data. In your training data, the token 'love' and the token 'hate' are used in nearly exactly the same way, with the same surrounding words. Only by seeing many subtly varied alternative uses of words, alongside many contrasting surrounding words, can these "dense embedding" models move the word-vectors to useful relative positions, where they are closer to related words & farther from other words. (And, since you've provided no training data with the token 'adore' , the model knows nothing about that word – and if it's provided inside a test document, as if to the model's infer_vector() method, it will be ignored. So the test document it 'sees' is only the known words ['i', 'hot', 'chocolate'] .)

But also, even if you did manage to train on a larger dataset, or somehow inject the knowledge from other word-vectors that 'love' and 'adore' are somewhat similar, it's important to note that antonyms are typically quite similar in sets of word-vectors, too – as they are used in the same contexts, and often syntactically interchangeable, and of the same general category. These models often aren't very good at detecting the flip-in-human-perceived meaning from the swapping of a word for its antonym (or insertion of a single 'not' or other reversing-intent words).

Ultimately if you want to use gensim's Doc2Vec , you should train it with far more data. (If you were willing to grab some other pre-trainined word-vectors, why not grab some other source of somewhat-similar bulk sentences? The effect of using data that isn't exactly like your actual problem will be similar whether you leverage that outside data via bulk text or a pre-trained model.)

Finally: it's a bad, error-prone pattern to be calling train() more than once in your own loop, with your own alpha adjustments. You can just call it once, with the right number of epochs , and the model will perform the multiple training passes & manage the internal alpha smoothly over the right number of epochs.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM