简体   繁体   中英

How to use doc2vec model in production?

I wonder how to deploy a doc2vec model in production to create word vectors as input features to a classifier. To be specific, let say, a doc2vec model is trained on a corpus as follows.

dataset['tagged_descriptions'] = datasetf.apply(lambda x: doc2vec.TaggedDocument(
            words=x['text_columns'], tags=[str(x.ID)]), axis=1)

model = doc2vec.Doc2Vec(vector_size=100, min_count=1, epochs=150, workers=cores,
                                window=5, hs=0, negative=5, sample=1e-5, dm_concat=1)

corpus = dataset['tagged_descriptions'].tolist()

model.build_vocab(corpus)

model.train(corpus, total_examples=model.corpus_count, epochs=model.epochs)

and then it is dumped into a pickle file. The word vectors are used to train a classifier such as random forests to predict movies sentiment.

Now suppose that in production, there is a document entailing some totally new vocabularies. That being said, they were not among the ones present during the training of the doc2vec model. I wonder how to tackle such a case.

As a side note, I am aware of Updating training documents for gensim Doc2Vec model and Gensim: how to retrain doc2vec model using previous word2vec model . However, I would appreciate more lights to be shed on this matter.

A Doc2Vec model will only be able to report trained-up vectors for documents that were present during training, and only be able to infer_vector() new doc-vectors for texts containing words that were present during training. (Unrecognized words passed to .infer_vector() will be ignored, similar to the way any words appearing fewer than min_count times are ignored during training.)

If over time you acquire many new texts with new vocabulary words, and those words are important, you'll have to occasionally re-train the Doc2Vec model. And, after re-training, the doc-vectors from the re-trained model will generally not be comparable to doc-vectors from the original model – so downstream classifiers and other applications using the doc-vectors will need updating, as well.

Your own production/deployment requirements will drive how often this re-training should happen, and old models replaced with newer ones.

(While a Doc2Vec model can be fed new training data at any time, doing so incrementally as a sort of 'fine-tuning' introduces hairy issues of balance between old and new data. And, there's no official gensim support for expanding existing the vocabulary of a Doc2Vec model. So, the most robust course is to retrain from scratch using all available data.)

A few side notes on your example training code:

  • it's rare for min_count=1 to be a good idea: rare words often serve as 'noise', without sufficient usage examples to model well, and thus 'noise' that just slows/interferes with patterns that can be learned from more common-words

  • dm_concat=1 is best considered an experimental/advanced mode, as it makes models significantly larger and slower to train, with unproven benefits.

  • much published work uses just 10-20 training epochs; smaller datasets or smaller docs will sometimes benefit from more, but 150 may be taking a lot of time with very little marginal benefit.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM