简体   繁体   中英

Scikit-learn Random Forest taking up too much memory

Problem

I have a dataset with 900,000 rows and 8 columns. 6 of the columns are integers and the other two are floats. When trying to fit about 1/4 of the dataset (200,000) the code runs fine and takes under 30 seconds. When I try to run 400,000 rows or greater My computer permanently freezes because the python.exe process takes up over 5GB of RAM.

Attempts

First thing I tried was setting the warm_state parameter to True and then going through the data 50,000 rows at a time

n = 0
i = 50,000  
clf = sk.RandomForestClassifier(oob_score = True,n_jobs=-1, n_estimators = n, warm_start=True)
While i<= 850,000:
    clf.fit(X.ix[n:i],Y.ix[n:i])
    n += 50,000
    i += 50,000

This didn't solve anything, I ran into the same issue.

Next thing I tried is finding if there was a part of the data that was taking much more memory to process. I recorded the memory increase in the python.exe process and the time it took for the process to complete, if it did complete.

n = 50
clf = sk.RandomForestClassifier(oob_score = True,n_jobs=-1, n_estimators = n, warm_start=True)
Z = X[['DayOfWeek','PdDistrict','Year','Day','Month']] # takes 15s and additional ~600mb RAM (800 total)
Z = X[['X','Address','Y']] # takes 24.8s and additional 1.1GB RAM (1389mb total)
Z = X # never finishes peaks at 5.2GB
%time clf.fit(Z.ix[0:400000],Y.ix[0:400000])

While some data does take longer to process than others none of them can account for 5 Gb of memory being taken.

The data is only a few megabytes in size so I don't see how it can take up so much memory to process.

The model you are building just gets too big. Get more ram or built a smaller model. To built a smaller model, either create less trees, or limit the depth of teach trees, say by using max_depth. Try with max_depth=5 and see what happens. Also, how many classes do you have? More classes make everything more expensive.

Also, you might want to try this: https://github.com/scikit-learn/scikit-learn/pull/4783

I get a similar situation with the too-large Random Forest model. The problem was that trees were too deep, and take a lot of memory. To deal with it, I set max_depth = 6 and it reduces the memory. I even write down about it in blog post . In the article, I was using 32k rows dataset with 15 columns. Setting max_depth=6 decreases memory consumption 66 times and keep similar performance (in the article the performance even increases).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM