为什么随机森林回归预测完全相同的值？

Question

I am attempting to use Scikit-Learn's Random Forest regressor to predict Nominal GDP from Real GDP.我正在尝试使用 Scikit-Learn 的随机森林回归器从实际 GDP 中预测名义 GDP。

I read the data from a webstite and clean it up a bit, then synthesize a dataframe with what I have forecasted are the next three years of Real GDP.我从网站上读取数据并稍微清理一下，然后将 dataframe 与我预测的未来三年的实际 GDP 综合起来。

I have the following code:我有以下代码：

from sklearn.ensemble import RandomForestRegressor

gdp = pd.read_html('https://www.thebalance.com/us-gdp-by-year-3305543')[0]
gdp.columns = gdp.iloc[0]
gdp = gdp[1:]

gdp['Year'] = gdp['Year'].astype(int)

gdp['Nominal GDP (trillions)'] = gdp['Nominal GDP (trillions)'].str.replace(',', '.').str.replace('$', '').astype(float)
gdp['Real GDP (trillions)'] = gdp['Real GDP (trillions)'].str.replace(',', '.').str.replace('$', '').astype(float)

X = pd.DataFrame(gdp['Real GDP (trillions)'].copy())
y = pd.DataFrame(gdp['Nominal GDP (trillions)'].copy())


X_pred = pd.DataFrame(data = [18.313, 18.960, 19.643], columns = ['Real GDP (trillions)'])

reg = RandomForestRegressor(n_estimators = 300)
reg.fit(X, y.values.ravel())

y_pred = reg.predict(X_pred)

It returns the following prediction: 1 |它返回以下预测：1 | 2 | 2 | 3 ---|---|--- 19.72172 | 3 ---|---|--- 19.72172 | 21.05464667 | 21.05464667 | 21.05464667 21.05464667

Why are the second and third predictions identical?为什么第二个和第三个预测相同？ It happens even if I change the X_pred values to something like [18.313, 18.960, 39.643]即使我将 X_pred 值更改为[18.313, 18.960, 39.643]之类的值，也会发生这种情况

Answer 1

In your training data, there's only one value > 18.960:在您的训练数据中，只有一个值 > 18.960：

X[X.values>18.960]

    Real GDP (trillions)
91  19.092

So it is highly unlikely you will end up with a value that can split 18.960 and 19.643, or for that matter, 18.960 and 39.643.因此，您极不可能最终得到一个可以拆分 18.960 和 19.643 的值，或者就此而言，18.960 和 39.643。 It is not linear regression where you can interpolate.它不是可以插值的线性回归。

We can check the thresholds for each tree:我们可以检查每棵树的阈值：

thres = np.unique([j for i in reg.estimators_ for j in i.tree_.threshold])
np.sort(thres)[-10:]

array([17.80000019, 17.9375    , 18.00199986, 18.05999947, 18.20950031,
       18.26199913, 18.41149998, 18.41599941, 18.61799908, 18.88999939])

The largest value of your threshold is not able to split the 2 values you are trying to predict, hence they will always end up in the same nodes, giving you the same prediction.您的阈值的最大值无法拆分您尝试预测的 2 个值，因此它们将始终位于相同的节点中，从而为您提供相同的预测。

为什么随机森林回归预测完全相同的值？

问题描述

1 个解决方案

解决方案1
1 已采纳 2020-12-12 09:58:49

为什么随机森林回归预测完全相同的值？

问题描述

1 个解决方案

解决方案1 1 已采纳 2020-12-12 09:58:49

解决方案1
1 已采纳 2020-12-12 09:58:49