简体   繁体   中英

SVD using Scikit-Learn and Gensim with 6 million features

I am trying to classify paragraphs based on their sentiments. I have training data of 600 thousand documents. When I convert them to Tf-Idf vector space with words as analyzer and ngram range as 1-2 there are almost 6 million features. So I have to do Singular value decomposition (SVD) to reduce features.

I have tried gensim and sklearn's SVD feature. Both work fine for feature reduction till 100 but as soon as I try for 200 features they throw memory error.

Also I have not used entire document (600 thousand) as training data, I have taken 50000 documents only. So essentially my training matrix is: 50000 * 6 million and want to reduce it to 50000 * (100 to 500)

Is there any other way I can implement it in python, or do I have to implement sparks mllib SVD(written for only java and scala) ? If Yes, how much faster will it be?

System specification: 32 Gb RAM with 4 core processors on ubuntu 14.04

I don't really see why using sparks mllib SVD would improve performance or avoid memory errors. You simply exceed the size of your RAM. You have some options to deal with that:

  • Reduce the dictionary size of your tf-idf (playing with max_df and min_df parameters of scikit-learn for example).
  • Use a hashing vectorizer instead of tf-idf.
  • Get more RAM (but at some point tf-idf + SVD is not scalable).

Also you should show your code sample, you might do something wrong in your python code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM