简体繁体中英

Doc2Vec Unsupervised training

原文 2020-03-21 19:29:03 6 1 python/ gensim/ doc2vec

I need a suggestion in unsupervised training of Doc2Vec for the 2 options I have. The scenario is I have N documents each of size greater than 3000 tokens. So now for training which alternative is better:

Training with whole document as such.
Breaking the documents into chunks of 1000 tokens and then training it.

1 answers

You should watch out for docs with more than 10000 tokens – that's an internal implementation limit of gensim , and tokens beyond the 10000th position in a single document will be ignored.

But whether you should split documents into 1000-token chunks is entirely dependent on what works best for your specific data and goals. If you have reason to consider it – perhaps you want to get back results of subdocument ranges? – then you should try it, compare the results to the alternative, and use whichever works better. There is no general answer.

Gensim Doc2Vec Training

Doc2Vec online training

What are doc2vec training iterations?

Gensim doc2vec training on ngrams

Doc2Vec Pre training and Inferring vectors

doc2vec - Input Format for doc2vec training and infer_vector() in python

Doc2Vec: reprojecting training documents into model space

Build a learning curve for training a doc2vec embedding

gensim - Doc2Vec: MemoryError when training on english Wikipedia

significance of periods in sentences while training documents with Doc2Vec

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Gensim Doc2Vec Training Doc2Vec online training What are doc2vec training iterations? Gensim doc2vec training on ngrams Doc2Vec Pre training and Inferring vectors doc2vec - Input Format for doc2vec training and infer_vector() in python Doc2Vec: reprojecting training documents into model space Build a learning curve for training a doc2vec embedding gensim - Doc2Vec: MemoryError when training on english Wikipedia significance of periods in sentences while training documents with Doc2Vec

Related Tags

Doc2Vec Unsupervised training

Question

1 answers

solution1 1 2020-03-22 05:49:29

solution1
1 2020-03-22 05:49:29