简体繁体中英

Gensim Doc2Vec Training

原文 2018-02-23 13:40:21 7 1 python/ gensim/ doc2vec

I am using gensim to train a Doc2Vec model on documents assigned to particular people. There are 10 million documents and 8,000 people. I don't care about all 8,000 people. I care about a specific group of people (say anywhere from 1 to 500).

The people I'm interested in could change day-to-day, but I will never need to look at the full population. The end goal is to have the resulting vectors of the people I am interested in. I am currently training the model each time on the documents assigned to the specific people.

Should I train the model on all 10 million documents? Or should I train the model on only the documents assigned to the people I'm interested in? If it's important to train it on all 10 million documents, how would I then get the vectors only for the people I'm interested in?

1 answers

It is a good idea to train on all the 10 million documents, that will help you capture the general essence of the words and not just with in the context of authors that you are interested in. Also, it will help you if the set of authors who you are interested in, changes tomorrow.

If you think Doc2Vec takes a lot of time, you could also use Fasttext to learn WordEmbeddings and use a simple average or TF-IDF weighted average on the word vectors to construct your DocumentVector. You could leverage the power of hierarchical softmax (loss function) in Fasttext that will reduce your training time by 1000+ folds.

Gensim doc2vec training on ngrams

gensim - Doc2Vec: MemoryError when training on english Wikipedia

Why use TaggedBrownCorpus when training gensim doc2vec

Gensim Doc2vec – KeyError: “tag not seen in training corpus/invalid”

gensim Doc2Vec vs tensorflow Doc2Vec

Doc2Vec online training

Doc2Vec Unsupervised training

Gensim doc2vec sentence tagging

Issues in doc2vec tags in Gensim

Gensim DOC2VEC trims and delete the vocabulary

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Gensim doc2vec training on ngrams gensim - Doc2Vec: MemoryError when training on english Wikipedia Why use TaggedBrownCorpus when training gensim doc2vec Gensim Doc2vec – KeyError: “tag not seen in training corpus/invalid” gensim Doc2Vec vs tensorflow Doc2Vec Doc2Vec online training Doc2Vec Unsupervised training Gensim doc2vec sentence tagging Issues in doc2vec tags in Gensim Gensim DOC2VEC trims and delete the vocabulary

Related Tags

Gensim Doc2Vec Training

Question

1 answers

solution1 3 ACCPTED 2018-02-23 13:54:25

solution1
3 ACCPTED 2018-02-23 13:54:25