简体   繁体   中英

How to do supervised learning with Gensim/Word2Vec/Doc2Vec having large corpus of text documents?

I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words. I am trying to do a supervised learning with these documents. My approach would be:

  1. Vectorize each document in the corpus. Say we have 2347 docs.
  2. I can have 2347 rows with labels viz. Like as 1 and Dislike as 0.
  3. Using any ML classification supervised model train above dataset with 2347 rows.

How to vectorize and create such dataset?

One of the things you can try is using Doc2Vec . This will allow you to map each document to a vector of dimension N. Then you can use any supervised learning algorithm to train on these N features.

There are other alternatives to doc2vec mentioned here . Try the Average of Word2Vec vectors with TF-IDF approach as well.

Also, make sure you apply appropriate text cleaning before applying doc2vec or word2vec. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here

I would also suggest engineering some features from your data if you are looking to predict like/dislike. This depends on your data and problem, but some examples are

  • The proportion of uppercase words
  • Slang words present or not
  • Emoticons present or not
  • Language of the text
  • The sentiment of the text - this is a whole new topic altogether though

I hope this was helpful...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM