简体   繁体   中英

Python sklearn RandomForestClassifier non-reproducible results

I've been using sklearn's random forest, and I've tried to compare several models. Then I noticed that random-forest is giving different results even with the same seed. I tried it both ways: random.seed(1234) as well as use random forest built-in random_state = 1234 In both cases, I get non-repeatable results. What have I missed...?

# 1
RandomForestClassifier(max_depth=5, max_features=5, criterion='gini', min_samples_leaf = 10)
# or 2
RandomForestClassifier(max_depth=5, max_features=5, criterion='gini', min_samples_leaf = 10, random_state=1234)

Any ideas? Thanks!!

EDIT: Adding a more complete version of my code

clf = RandomForestClassifier(max_depth=60, max_features=60, \
                        criterion='entropy', \
                        min_samples_leaf = 3, random_state=seed)
# As describe, I tried random_state in several ways, still diff results
clf = clf.fit(X_train, y_train)

predicted = clf.predict(X_test)
predicted_prob = clf.predict_proba(X_test)[:, 1]
fpr, tpr, thresholds = metrics.roc_curve(np.array(y_test), predicted_prob)
auc = metrics.auc(fpr,tpr)
print (auc)

EDIT: It's been quite a while, but I think using RandomState might solve the problem. I didn't test it yet myself, but if you're reading it, it's worth a shot. Also, it is generally preferable to use RandomState instead of random.seed().

First make sure that you have the latest versions of the needed modules(eg scipy, numpy etc). When you type random.seed(1234) , you use the numpy generator.

When you use random_state parameter inside the RandomForestClassifier , there are several options: int , RandomState instance or None .

From the docs here :

  • If int, random_state is the seed used by the random number generator;

  • If RandomState instance, random_state is the random number generator;

  • If None, the random number generator is the RandomState instance used by np.random.

A way to use the same generator in both cases is the following. I use the same (numpy) generator in both cases and I get reproducible results (same results in both cases).

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *

X, y = make_classification(n_samples=1000, n_features=4,
                       n_informative=2, n_redundant=0,
                       random_state=0, shuffle=False)

clf = RandomForestClassifier(max_depth=2)
clf.fit(X, y)

clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
clf2.fit(X, y)

Check if the results are the same:

all(clf.predict(X) == clf2.predict(X))

Check after running the same code for 5 times:

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from numpy import *

for i in range(5):

    X, y = make_classification(n_samples=1000, n_features=4,
                       n_informative=2, n_redundant=0,
                       random_state=0, shuffle=False)

    clf = RandomForestClassifier(max_depth=2)
    clf.fit(X, y)

    clf2 = RandomForestClassifier(max_depth=2, random_state = random.seed(1234))
    clf2.fit(X, y)

    print(all(clf.predict(X) == clf2.predict(X)))



Ok, what solved it eventually, is reinstalling the conda environment. I'm still not sure why the different results happened. Thanks

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM