使用 Python K-Fold 和随机森林对子集进行交叉验证

Question

我有一个数据集，其中荷兰大城市社区的能源消耗作为因变量和几个自变量。 我想做一个随机森林回归 model 来预测仅阿姆斯特丹的社区值。 现在，我尝试仅在阿姆斯特丹的街区训练 model，但数据集太小，准确度得分（RMSE、MAE、R2）很差，尽管 model 在整个 large_city 数据集上表现良好。

我基本上想做的是对 RF model 进行 10 倍交叉验证。 我只想将阿姆斯特丹数据分成 10 倍，然后我想将 large_city 数据集的 rest（因此除阿姆斯特丹以外的所有社区）添加到所有折叠的训练集中，但保持测试折叠相同。

简而言之：

阿姆斯特丹 = large_cities == '阿姆斯特丹'

without_amsterdam = large_cities != '阿姆斯特丹'

10 折交叉验证，以 amsterdam 的 1/10 作为测试数据，amsterdam 的 9/10 + 所有 without_amsterdam 作为每折的训练数据。

到目前为止我制作的代码：

from sklearn.model_selection import KFold, cross_val_score

amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']

X = amsterdam.iloc[:, 4:].values
y = np.array(amsterdam.iloc[:, 3].values)

# split the data into 10 folds.  
# I will use this 'kf'(KFold splitting stratergy) object as 
#input to cross_val_score() method
kf = KFold(n_splits=10, shuffle=True, random_state=42)

cnt = 1
# split()  method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    cnt += 1


def rmse(score):
    rmse = np.sqrt(-score)
    print(f'rmse= {"{:.2f}".format(rmse)}')

score = cross_val_score(ensemble.RandomForestRegressor(random_state= 42), 
X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold are: {score}')
rmse(score.mean())

我在上面的代码中所做的是只对 amsterdam 数据进行 10 倍交叉验证。 如何将 without_ams 的数据添加到阿姆斯特丹的每个火车折叠中？

我希望我正在尝试做的事情是有意义的。

Answer 1

您可以提供 train、test 到 cross_val_score 的索引，请参阅帮助页面。 因此，在您的情况下，使用示例数据集：

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

big_cities = pd.DataFrame(np.random.normal(0,1,(200,6)))
big_cities.insert(0,'gm_naam',
np.random.choice(['Amsterdam','Stockholm','Copenhagen'],200))

关键是 append 您的 dataframe 与阿姆斯特丹其次是其他人，您也可以通过排序来做到这一点：

amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']

non_amsterdam_index = np.arange(len(amsterdam),len(without_ams))

combined = pd.concat([amsterdam,without_ams])

现在我们仅使用 amsterdam 部分获得 cv 索引：

X = amsterdam.iloc[:, 4:]
y = amsterdam.iloc[:, 3]

kf = KFold(n_splits=3, shuffle=True, random_state=42)

我们 append 非阿姆斯特丹索引到每个火车折叠：

cvs = [[np.append(i,non_amsterdam_index),j] for i,j in kf.split(X, y)]

我们可以检查一下：

for train,test in cvs:
    print("train composition")
    print(combined.iloc[train,]["gm_naam"].value_counts())
    print("test composition")
    print(combined.iloc[test,]["gm_naam"].value_counts())

可以看到测试只有amsterdam：

train composition
Amsterdam     48
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    25
Name: gm_naam, dtype: int64
train composition
Amsterdam     49
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    24
Name: gm_naam, dtype: int64
train composition
Amsterdam     49
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    24
Name: gm_naam, dtype: int64

然后交叉 val 这个：

score = cross_val_score(RandomForestRegressor(random_state= 42),
                        X = combined.iloc[:, 4:], 
                        y = combined.iloc[:, 3], 
                        cv= cvs, scoring="neg_mean_squared_error")

使用 Python K-Fold 和随机森林对子集进行交叉验证

问题描述

1 个解决方案

解决方案1
0 2021-05-13 13:45:45

使用 Python K-Fold 和随机森林对子集进行交叉验证

问题描述

1 个解决方案

解决方案1 0 2021-05-13 13:45:45

解决方案1
0 2021-05-13 13:45:45