简体   繁体   中英

Sklearn random forest model is too big

Question from the beginner in sklearn , please advise. I have RandomForestClassifier model trained with the following parameters:

n_estimators = 32,
criterion = 'gini',
max_depth = 380,

This parameters are not randomly chosen, for some reason they showed the best performance...though seem strange to myself.

The model size is about 5.5GB when saved with joblib.dump and compress=3

The data used is:

tfidf=TfidfVectorizer()
X_train=tfidf.fit_transform(X_train)

and

le=LabelEncoder()
le.fit(y_train)
y_train=le.fit_transform(y_train)

with a sample size of 4.7Mio records splitted 0.3 (70% train, 30% test)

Now, I have a question, maybe someone can help with:

Does it make sense to you the Parameters used for the model and the size of the model in respect to the size of the sample? Probably the choice of the parameters is not optimal for the model that increases the size (I do understand that the main parameter increasing the size here is max_depth , but the result was the best...)

Maybe there is any suggestions on the Parameters or data preparation in general, as in my experience with this sample, I noticed the following: 1. Increasing n_estimators makes almost no difference on the outcome; 2. Increasing max_depth on the other hand brings significant improvements. As example: - max_depth = 10 - accuracy_score of 0.3 - max_depth = 380 - accuracy_score of 0.95

Any suggestions, advise is very welcome!:)

UPD. Accuracy results

Train Score: 0.988 classifier.score

OOB Score: 0.953 classifier.oob_score_

Test Score: 0.935 sklearn.metrics -> accuracy_score

Try to use min_samples_leaf instead of max_depth to limit tree depths. This allows different depths for different paths of the tree, and for different estimators. Hopefully making it possible to find a model which has good performance with a lower average depth. I like to set min_samples_leaf as a float, meaning a fraction of the number of samples. Try gridsearch between (0.0001, 0.1)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM