I have a set of text documents(2000+) with labels (Liked/Disliked).Each document consists of 200+ words. I am trying to do a supervised learning with these documents. My approach would be:
How to vectorize and create such dataset?
One of the things you can try is using Doc2Vec . This will allow you to map each document to a vector of dimension N. Then you can use any supervised learning algorithm to train on these N features.
There are other alternatives to doc2vec mentioned here . Try the Average of Word2Vec vectors with TF-IDF approach as well.
Also, make sure you apply appropriate text cleaning before applying doc2vec or word2vec. Steps like case normalization, stopword removal, punctuation removal, etc. It really depends on your dataset. Find out more here
I would also suggest engineering some features from your data if you are looking to predict like/dislike. This depends on your data and problem, but some examples are
I hope this was helpful...
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.