简体   繁体   English

使用 Python K-Fold 和随机森林对子集进行交叉验证

[英]Cross validation for a subset using Python K-Fold and Random Forest

I have a dataset with the energy consumption of neighbourhoods in large Dutch cities as the dependent variable and several independent variables.我有一个数据集,其中荷兰大城市社区的能源消耗作为因变量和几个自变量。 I want make a Random Forest regression model to predict the values of neighbourhoods in only Amsterdam.我想做一个随机森林回归 model 来预测仅阿姆斯特丹的社区值。 Now, I tried to train the model only on Amsterdam's neighbourhoods, but the dataset is too small and the accuracy scores (RMSE, MAE, R2) are bad, although the model performs well on the entire large_city dataset.现在,我尝试仅在阿姆斯特丹的街区训练 model,但数据集太小,准确度得分(RMSE、MAE、R2)很差,尽管 model 在整个 large_city 数据集上表现良好。

What I basically want to do is do a 10-fold cross validation on the RF model.我基本上想做的是对 RF model 进行 10 倍交叉验证。 I want to only divide the Amsterdam data into 10-fold, then I want to add the rest of the large_city dataset (so all neighbourhoods except those in Amsterdam) to the training sets of all fold, but leave the test folds the same.我只想将阿姆斯特丹数据分成 10 倍,然后我想将 large_city 数据集的 rest(因此除阿姆斯特丹以外的所有社区)添加到所有折叠的训练集中,但保持测试折叠相同。

So in short:简而言之:

amsterdam = large_cities == 'Amsterdam'阿姆斯特丹 = large_cities == '阿姆斯特丹'

without_amsterdam = large_cities != 'Amsterdam' without_amsterdam = large_cities != '阿姆斯特丹'

10-fold cross validation with 1/10 of amsterdam as test data, and 9/10 of amsterdam + all of without_amsterdam as train data per fold. 10 折交叉验证,以 amsterdam 的 1/10 作为测试数据,amsterdam 的 9/10 + 所有 without_amsterdam 作为每折的训练数据。

The code I made so far:到目前为止我制作的代码:

from sklearn.model_selection import KFold, cross_val_score

amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']

X = amsterdam.iloc[:, 4:].values
y = np.array(amsterdam.iloc[:, 3].values)

# split the data into 10 folds.  
# I will use this 'kf'(KFold splitting stratergy) object as 
#input to cross_val_score() method
kf = KFold(n_splits=10, shuffle=True, random_state=42)

cnt = 1
# split()  method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    cnt += 1


def rmse(score):
    rmse = np.sqrt(-score)
    print(f'rmse= {"{:.2f}".format(rmse)}')

score = cross_val_score(ensemble.RandomForestRegressor(random_state= 42), 
X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold are: {score}')
rmse(score.mean())

What I do in the code above is I make a 10-fold cross-validation for only the amsterdam data.我在上面的代码中所做的是只对 amsterdam 数据进行 10 倍交叉验证。 How can I add the data of without_ams to every train fold of amsterdam?如何将 without_ams 的数据添加到阿姆斯特丹的每个火车折叠中?

I hope it makes sense what I am trying to do.我希望我正在尝试做的事情是有意义的。

You can provide the indices for train, test to cross_val_score, see help page .您可以提供 train、test 到 cross_val_score 的索引,请参阅帮助页面 So in your case using an example dataset:因此,在您的情况下,使用示例数据集:

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

big_cities = pd.DataFrame(np.random.normal(0,1,(200,6)))
big_cities.insert(0,'gm_naam',
np.random.choice(['Amsterdam','Stockholm','Copenhagen'],200))

The key is append your dataframe with Amsterdam followed by others, you can also do this by sorting:关键是 append 您的 dataframe 与阿姆斯特丹其次是其他人,您也可以通过排序来做到这一点:

amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']

non_amsterdam_index = np.arange(len(amsterdam),len(without_ams))

combined = pd.concat([amsterdam,without_ams])

Now we get the cv index using only the amsterdam part:现在我们仅使用 amsterdam 部分获得 cv 索引:

X = amsterdam.iloc[:, 4:]
y = amsterdam.iloc[:, 3]

kf = KFold(n_splits=3, shuffle=True, random_state=42)

And we append the non amsterdam index to each train fold:我们 append 非阿姆斯特丹索引到每个火车折叠:

cvs = [[np.append(i,non_amsterdam_index),j] for i,j in kf.split(X, y)]

We can check this:我们可以检查一下:

for train,test in cvs:
    print("train composition")
    print(combined.iloc[train,]["gm_naam"].value_counts())
    print("test composition")
    print(combined.iloc[test,]["gm_naam"].value_counts())

You can see the test is only amsterdam:可以看到测试只有amsterdam:

train composition
Amsterdam     48
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    25
Name: gm_naam, dtype: int64
train composition
Amsterdam     49
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    24
Name: gm_naam, dtype: int64
train composition
Amsterdam     49
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    24
Name: gm_naam, dtype: int64

Then cross val this:然后交叉 val 这个:

score = cross_val_score(RandomForestRegressor(random_state= 42),
                        X = combined.iloc[:, 4:], 
                        y = combined.iloc[:, 3], 
                        cv= cvs, scoring="neg_mean_squared_error")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM