I am trying to classify paragraphs based on their sentiments. I have training data of 600 thousand documents. When I convert them to Tf-Idf
vector space with words as analyzer and ngram range as 1-2 there are almost 6 million features. So I have to do Singular value decomposition (SVD) to reduce features.
I have tried gensim and sklearn's SVD feature. Both work fine for feature reduction till 100 but as soon as I try for 200 features they throw memory error.
Also I have not used entire document (600 thousand) as training data, I have taken 50000 documents only. So essentially my training matrix is: 50000 * 6 million and want to reduce it to 50000 * (100 to 500)
Is there any other way I can implement it in python, or do I have to implement sparks mllib SVD(written for only java and scala) ? If Yes, how much faster will it be?
System specification: 32 Gb RAM with 4 core processors on ubuntu 14.04
I don't really see why using sparks mllib SVD would improve performance or avoid memory errors. You simply exceed the size of your RAM. You have some options to deal with that:
Also you should show your code sample, you might do something wrong in your python code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.