简体   繁体   中英

different results for Random Forest Regression in R and Python

I am using the same data to do Random Forest Regression in R and Python but I am getting very different R2 values. I understand that hyper parameters might be a reason behind this but I don't think it results in almost halving of R2 scores. I am using the following codes and getting the respective results.

In Python -

    X =  data.drop(['response'],axis=1)
    y = data['response'] 
   
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.05, random_state = 42)

    rdf = RandomForestRegressor(n_estimators=500,oob_score=True)
    rdf.fit(X_train, y_train)

    print("Random Forest Model Score (on Train)" , ":" , rdf.score(X_train, y_train)*100 , "," ,
          "Random Forest Model Score (on Test)" ,":" , rdf.score(X_test, y_test)*100)   

    y_predicted = rdf.predict(X_train)
    y_test_predicted = rdf.predict(X_test)

    print("Training RMSE", ":", rmse(y_train, y_predicted),
          "Testing RMSE", ":", rmse(y_test, y_test_predicted))


>Random Forest Model Score (on Train) : 92.2312123 , Random Forest Model Score (on Test) : 78.1812321

>Training RMSE : 5.606443558164292e-06   Testing RMSE : 9.59221499904858e-06

In R -

> rows <- sample(0.95*nrow(data))
> train_random <- data[rows,]
> test_random <-  data[-rows,]

> rf_model <- randomForest(response ~ . ,
                         data = train_random,
                         keep.forest=TRUE,
                         importance=TRUE
                         )

> rf_model

Call:
 randomForest(formula = response ~ ., data = train_random, keep.forest = TRUE, importance = TRUE) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 6

          Mean of squared residuals: 1.437236e-06
                    % Var explained: 42.05
> pred_train <- predict(rf_model,train_random)
> pred_test <- predict(rf_model,test_random)
> R2_Score(pred_train, train_random$response)
[1] 0.9014311
> R2_Score(pred_test, test_random$response)
[1] 0.3616823

I understand that the test train split is not resulting in the same splits but why am I getting such distinctly different R2 values and what is the way to carry out the same Random Forest in R. I have tried using the same hyper parameters I am getting from Python but it is not helping me get the same R2 values in R. Can someone please help me?

As others have commented, there is a random component to Random Forests, which you probably already knew.

But also Random forest uses bootstrapping, which can change the outcome each time it is run. I have included a link for further study. Hopefully this helps guide you toward a desired answer.

https://stats.stackexchange.com/questions/120446/different-results-from-several-passes-of-random-forest-on-same-dataset

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM