简体   繁体   中英

Gensim Doc2Vec Training

I am using gensim to train a Doc2Vec model on documents assigned to particular people. There are 10 million documents and 8,000 people. I don't care about all 8,000 people. I care about a specific group of people (say anywhere from 1 to 500).

The people I'm interested in could change day-to-day, but I will never need to look at the full population. The end goal is to have the resulting vectors of the people I am interested in. I am currently training the model each time on the documents assigned to the specific people.

Should I train the model on all 10 million documents? Or should I train the model on only the documents assigned to the people I'm interested in? If it's important to train it on all 10 million documents, how would I then get the vectors only for the people I'm interested in?

It is a good idea to train on all the 10 million documents, that will help you capture the general essence of the words and not just with in the context of authors that you are interested in. Also, it will help you if the set of authors who you are interested in, changes tomorrow.

If you think Doc2Vec takes a lot of time, you could also use Fasttext to learn WordEmbeddings and use a simple average or TF-IDF weighted average on the word vectors to construct your DocumentVector. You could leverage the power of hierarchical softmax (loss function) in Fasttext that will reduce your training time by 1000+ folds.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM