简体繁体中英

Sklearn K-Fold Cross Validation Memory Issues

原文 2015-07-08 15:43:02 3 1 python/ memory/ scikit-learn/ cross-validation

I'm trying to run some supervised experiments with a simple text classifier, but I'm running into memory issues in using the K Fold generator in Sklearn . The error I'm getting is states: "Your system has run out of application memory", but my dataset is only ~245K rows x ~81K columns. Large-ish, sure, but not huge. The program never terminates, but rather "hangs" until I manually shut down the terminal app. I've let it run like this for about 30 minutes, with no progress.

I've also written in print statements to see where in the cross validation for-loop the code gets stuck. It looks like the indices for training and test sets are generated, but the code never gets to the point of slicing off the actual training and test sets for features and labels using these indices. I'm running this on a Macbook Pro running 10.9.5. I've run this shutting down every other app except the Terminal app, with no success. Has anyone else had problems with this or is this likely something specific to my machine?

: I've run this with 10-fold and 5-fold cross validation and run into the same problems each time. ：我已经使用10倍和5倍交叉验证来运行此程序，每次都遇到相同的问题。

1 answers

I think the first issue comes from this part:

my dataset is only ~245K rows x ~81K columns. Large-ish, sure, but not huge.

245K x 80K does not sound huge, but let's just do the math and assume 8 bytes per element stored.If your matrix is not sparse (obviously in your case it is a sparse matrix), that would be 245 * 80 * 8 MB so roughly 160 GB that need stored in your RAM. This is actually huge!

You mention text classification so I'm guessing your features are tf-idf or counts of words and that it is very sparse. What you need to be wary about now is to keep the sparsity at each step and to use only algorithms that work with sparse data and that will not allocate a dense matrix of size n_samples * n_features .

Naive bayes classifiers (see sklearn.naive_bayes.MultinomialNB for instance) have had decent success in text classification, I would start there.

Such a classifier can easily handle 250K x 80K matrix as long as it is a sparse matrix (and is actually sparse enough of course).

If you still want to reduce the number of features you get from tf-idf you have several options:

Remove stop words, either with a stop words list or by setting the max_df parameter to a value around 0.7 or lower (this will discard any term that is in more than 70% of the documents).
Apply feature selection before training your classifier. This scikit-learn example shows how to use the chi-squared statistics to select features based on sparse data.
Apply dimensionality reduction techniques such as SVD (I'd look into Latent semantic indexing , but I am not proficient with this).

The options 1. and 2. combined should already allow you to significantly reduce the number of features.

Let me know if that helps.

Predicted values of each fold in K-Fold Cross Validation in sklearn

In k-Fold cross-validation, is a new model initiated for each fold in Sklearn?

Sample k-fold cross validation in Python

K-Fold Cross Validation on entire Dataset

K-fold Cross Validation with RandomForest

K-fold Cross Validation Queries

k-fold cross validation using tensorflow

k-Fold Cross Validation in Keras python

K-fold cross validation implementation python

K-Fold Cross Validation for Naive Bayes

暂无

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

Related Question Predicted values of each fold in K-Fold Cross Validation in sklearn In k-Fold cross-validation, is a new model initiated for each fold in Sklearn? Sample k-fold cross validation in Python K-Fold Cross Validation on entire Dataset K-fold Cross Validation with RandomForest K-fold Cross Validation Queries k-fold cross validation using tensorflow k-Fold Cross Validation in Keras python K-fold cross validation implementation python K-Fold Cross Validation for Naive Bayes

Related Tags

Sklearn K-Fold Cross Validation Memory Issues

Question

1 answers

solution1 4 ACCPTED 2015-07-08 20:38:09

solution1
4 ACCPTED 2015-07-08 20:38:09