简体繁体中英

SVD using Scikit-Learn and Gensim with 6 million features

原文 2017-02-18 14:30:56 5 1 python/ scikit-learn/ gensim/ svd

I am trying to classify paragraphs based on their sentiments. I have training data of 600 thousand documents. When I convert them to Tf-Idf vector space with words as analyzer and ngram range as 1-2 there are almost 6 million features. So I have to do Singular value decomposition (SVD) to reduce features.

I have tried gensim and sklearn's SVD feature. Both work fine for feature reduction till 100 but as soon as I try for 200 features they throw memory error.

Also I have not used entire document (600 thousand) as training data, I have taken 50000 documents only. So essentially my training matrix is: 50000 * 6 million and want to reduce it to 50000 * (100 to 500)

Is there any other way I can implement it in python, or do I have to implement sparks mllib SVD(written for only java and scala) ? If Yes, how much faster will it be?

System specification: 32 Gb RAM with 4 core processors on ubuntu 14.04

1 answers

I don't really see why using sparks mllib SVD would improve performance or avoid memory errors. You simply exceed the size of your RAM. You have some options to deal with that:

Reduce the dictionary size of your tf-idf (playing with max_df and min_df parameters of scikit-learn for example).
Use a hashing vectorizer instead of tf-idf.
Get more RAM (but at some point tf-idf + SVD is not scalable).

Also you should show your code sample, you might do something wrong in your python code.

Using scikit-learn vectorizers and vocabularies with gensim

Using multiple features with scikit-learn

Handling categorical features using scikit-learn

Using gensim word2vec in scikit-learn pipeline

Use scikit-learn TfIdf with gensim LDA

What distance function is scikit-learn using for categorical features?

Using Sentence-Bert with other features in scikit-learn

handling too many categorical features using scikit-learn

Determine what features to drop / select using GridSearch in scikit-learn

scikit-learn SGD Document Classifier : Using important features only

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Using scikit-learn vectorizers and vocabularies with gensim Using multiple features with scikit-learn Handling categorical features using scikit-learn Using gensim word2vec in scikit-learn pipeline Use scikit-learn TfIdf with gensim LDA What distance function is scikit-learn using for categorical features? Using Sentence-Bert with other features in scikit-learn handling too many categorical features using scikit-learn Determine what features to drop / select using GridSearch in scikit-learn scikit-learn SGD Document Classifier : Using important features only

Related Tags

SVD using Scikit-Learn and Gensim with 6 million features

Question

1 answers

solution1 0 2017-02-18 16:29:06

solution1
0 2017-02-18 16:29:06