简体繁体中英

Doc2Vec having large corpus of text documents?

原文 2020-01-24 06:05:47 0 1 python/ nlp/ gensim/ word2vec/ doc2vec

I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words. I am trying to do a supervised learning with these documents. My approach would be:

Vectorize each document in the corpus. Say we have 2347 docs.
I can have 2347 rows with labels viz. Like as 1 and Dislike as 0.
Using any ML classification supervised model train above dataset with 2347 rows.

How to vectorize and create such dataset?

1 answers

One of the things you can try is using Doc2Vec . This will allow you to map each document to a vector of dimension N. Then you can use any supervised learning algorithm to train on these N features.

There are other alternatives to doc2vec mentioned here . Try the Average of Word2Vec vectors with TF-IDF approach as well.

Also, make sure you apply appropriate text cleaning before applying doc2vec or word2vec. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here

I would also suggest engineering some features from your data if you are looking to predict like/dislike. This depends on your data and problem, but some examples are

The proportion of uppercase words
Slang words present or not
Emoticons present or not
Language of the text
The sentiment of the text - this is a whole new topic altogether though

I hope this was helpful...

Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities)

Gensim: how to retrain doc2vec model using previous word2vec model

Can i build vocaburay in twice with gensim word2vec or doc2vec?

gensim doc2vec documents not found by id

gensim Doc2Vec word not in vocabulary

Doc2Vec gensim with supervised data predefined labels

Doc2vec and word2vec with negative sampling

Gensim Doc2Vec - Pass corpus sentences to Doc2Vec function

word2vec gensim update learning rate

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Gensim's Doc2Vec - How to use pre-trained word2vec (word similarities) Gensim: how to retrain doc2vec model using previous word2vec model Can i build vocaburay in twice with gensim word2vec or doc2vec? gensim doc2vec documents not found by id gensim Doc2Vec word not in vocabulary Doc2Vec gensim with supervised data predefined labels Doc2vec and word2vec with negative sampling Find similarity with doc2vec like word2vec Gensim Doc2Vec - Pass corpus sentences to Doc2Vec function word2vec gensim update learning rate

Related Tags

How to do supervised learning with Gensim/Word2Vec/Doc2Vec having large corpus of text documents?

Question

1 answers

solution1 4 2020-01-24 12:50:18

solution1
4 2020-01-24 12:50:18