简体   繁体   中英

Scikit-learn MemoryError with RandomForestClassifier

I am following along with the tutorial here: https://blog.hyperiondev.com/index.php/2019/02/18/machine-learning/

I have the exact same code the author uses, but I will still share it below...

train_data = scipy.io.loadmat('train_32x32.mat')
X = train_data['X']
y = train_data['y']

img_index = 24

X = X.reshape(X.shape[0]*X.shape[1]*X.shape[2],X.shape[3]).T
y = y.reshape(y.shape[0],)
X, y = shuffle(X, y, random_state=42)

clf = RandomForestClassifier(n_estimators=10, n_jobs=1)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
clf.fit(X_train, y_train) <-----------(MEMORY ERROR)

preds = clf.predict(X_test)
print("Accuracy:", accuracy_score(y_test,preds))

The dataset I am using is basically a dictionary of numbers and pictures of numbers. Everytime I get to the line which I pointed out above, I receive a MemoryError . The full error traceback is below:

Traceback (most recent call last):
  File "C:/Users/jack.walsh/Projects/img_recog/main.py", line 22, in <module>
    clf.fit(X_train, y_train)
  File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\ensemble\forest.py", line 249, in fit
    X = check_array(X, accept_sparse="csc", dtype=DTYPE)
  File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\sklearn\utils\validation.py", line 496, in check_array
    array = np.asarray(array, dtype=dtype, order=order)
  File "C:\Users\jack.walsh\AppData\Local\Programs\Python\Python37-32\lib\site-packages\numpy\core\numeric.py", line 538, in asarray
    return array(a, dtype, copy=False, order=order)
MemoryError

I ran Resource Monitor side-by-side with it and realized my used memory never goes above 30%. Let me know how I can get around this without altering the results!

X.shape = (73257, 3072)

X_train.shape = (51279, 3072)

I have 16GB RAM on this machine.

Given that your dataset has 3072 columns (reasonable for images), I simply think that it's too overloaded for a random forest, especially when you have no regularization applied to the classifier. The machine simply don't have enough memory to allocate for such a gigantic model.

Something that I would do in this situation:

  1. Reduce the number of features before training, difficult to do as your data is image and each column is just a pixel value, maybe you can resize your image to be smaller.

  2. Add regularization to your random forest classifier, for example, set max_depth to be smaller or set max_features so that every time when splitting, not all 3072 features are considered. Here's the full list of parameters that you can tune: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html

  3. According to Scikit Learn RandomForest Memory Error , setting n_jobs=1 might help as well.

  4. Lastly, I would personally not use random forest for image classifications. I would choose classifiers like SVM or go deep with deep learning models.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM