简体   繁体   中英

Doc2Vec Unsupervised training

I need a suggestion in unsupervised training of Doc2Vec for the 2 options I have. The scenario is I have N documents each of size greater than 3000 tokens. So now for training which alternative is better:

  1. Training with whole document as such.
  2. Breaking the documents into chunks of 1000 tokens and then training it.

You should watch out for docs with more than 10000 tokens – that's an internal implementation limit of gensim , and tokens beyond the 10000th position in a single document will be ignored.

But whether you should split documents into 1000-token chunks is entirely dependent on what works best for your specific data and goals. If you have reason to consider it – perhaps you want to get back results of subdocument ranges? – then you should try it, compare the results to the alternative, and use whichever works better. There is no general answer.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM