简体   繁体   中英

Sklearn K-Fold Cross Validation Memory Issues

I'm trying to run some supervised experiments with a simple text classifier, but I'm running into memory issues in using the K Fold generator in Sklearn . The error I'm getting is states: "Your system has run out of application memory", but my dataset is only ~245K rows x ~81K columns. Large-ish, sure, but not huge. The program never terminates, but rather "hangs" until I manually shut down the terminal app. I've let it run like this for about 30 minutes, with no progress.

I've also written in print statements to see where in the cross validation for-loop the code gets stuck. It looks like the indices for training and test sets are generated, but the code never gets to the point of slicing off the actual training and test sets for features and labels using these indices. I'm running this on a Macbook Pro running 10.9.5. I've run this shutting down every other app except the Terminal app, with no success. Has anyone else had problems with this or is this likely something specific to my machine?

: I've run this with 10-fold and 5-fold cross validation and run into the same problems each time. :我已经使用10倍和5倍交叉验证来运行此程序,每次都遇到相同的问题。

I think the first issue comes from this part:

my dataset is only ~245K rows x ~81K columns. Large-ish, sure, but not huge.

245K x 80K does not sound huge, but let's just do the math and assume 8 bytes per element stored.If your matrix is not sparse (obviously in your case it is a sparse matrix), that would be 245 * 80 * 8 MB so roughly 160 GB that need stored in your RAM. This is actually huge!

You mention text classification so I'm guessing your features are tf-idf or counts of words and that it is very sparse. What you need to be wary about now is to keep the sparsity at each step and to use only algorithms that work with sparse data and that will not allocate a dense matrix of size n_samples * n_features .

Naive bayes classifiers (see sklearn.naive_bayes.MultinomialNB for instance) have had decent success in text classification, I would start there.

Such a classifier can easily handle 250K x 80K matrix as long as it is a sparse matrix (and is actually sparse enough of course).

If you still want to reduce the number of features you get from tf-idf you have several options:

  1. Remove stop words, either with a stop words list or by setting the max_df parameter to a value around 0.7 or lower (this will discard any term that is in more than 70% of the documents).
  2. Apply feature selection before training your classifier. This scikit-learn example shows how to use the chi-squared statistics to select features based on sparse data.
  3. Apply dimensionality reduction techniques such as SVD (I'd look into Latent semantic indexing , but I am not proficient with this).

The options 1. and 2. combined should already allow you to significantly reduce the number of features.

Let me know if that helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM