Cross validation for a subset using Python K-Fold and Random Forest

Question

I have a dataset with the energy consumption of neighbourhoods in large Dutch cities as the dependent variable and several independent variables. I want make a Random Forest regression model to predict the values of neighbourhoods in only Amsterdam. Now, I tried to train the model only on Amsterdam's neighbourhoods, but the dataset is too small and the accuracy scores (RMSE, MAE, R2) are bad, although the model performs well on the entire large_city dataset.

What I basically want to do is do a 10-fold cross validation on the RF model. I want to only divide the Amsterdam data into 10-fold, then I want to add the rest of the large_city dataset (so all neighbourhoods except those in Amsterdam) to the training sets of all fold, but leave the test folds the same.

So in short:

amsterdam = large_cities == 'Amsterdam'

without_amsterdam = large_cities != 'Amsterdam'

10-fold cross validation with 1/10 of amsterdam as test data, and 9/10 of amsterdam + all of without_amsterdam as train data per fold.

The code I made so far:

from sklearn.model_selection import KFold, cross_val_score

amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']

X = amsterdam.iloc[:, 4:].values
y = np.array(amsterdam.iloc[:, 3].values)

# split the data into 10 folds.  
# I will use this 'kf'(KFold splitting stratergy) object as 
#input to cross_val_score() method
kf = KFold(n_splits=10, shuffle=True, random_state=42)

cnt = 1
# split()  method generate indices to split data into training and test set.
for train_index, test_index in kf.split(X, y):
    print(f'Fold:{cnt}, Train set: {len(train_index)}, Test set:{len(test_index)}')
    cnt += 1


def rmse(score):
    rmse = np.sqrt(-score)
    print(f'rmse= {"{:.2f}".format(rmse)}')

score = cross_val_score(ensemble.RandomForestRegressor(random_state= 42), 
X, y, cv= kf, scoring="neg_mean_squared_error")
print(f'Scores for each fold are: {score}')
rmse(score.mean())

What I do in the code above is I make a 10-fold cross-validation for only the amsterdam data. How can I add the data of without_ams to every train fold of amsterdam?

I hope it makes sense what I am trying to do.

Answer 1

You can provide the indices for train, test to cross_val_score, see help page . So in your case using an example dataset:

from sklearn.model_selection import KFold, cross_val_score
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
import numpy as np

big_cities = pd.DataFrame(np.random.normal(0,1,(200,6)))
big_cities.insert(0,'gm_naam',
np.random.choice(['Amsterdam','Stockholm','Copenhagen'],200))

The key is append your dataframe with Amsterdam followed by others, you can also do this by sorting:

amsterdam = big_cities.loc[big_cities['gm_naam'] == 'Amsterdam']
without_ams = big_cities.loc[big_cities['gm_naam'] != 'Amsterdam']

non_amsterdam_index = np.arange(len(amsterdam),len(without_ams))

combined = pd.concat([amsterdam,without_ams])

Now we get the cv index using only the amsterdam part:

X = amsterdam.iloc[:, 4:]
y = amsterdam.iloc[:, 3]

kf = KFold(n_splits=3, shuffle=True, random_state=42)

And we append the non amsterdam index to each train fold:

cvs = [[np.append(i,non_amsterdam_index),j] for i,j in kf.split(X, y)]

We can check this:

for train,test in cvs:
    print("train composition")
    print(combined.iloc[train,]["gm_naam"].value_counts())
    print("test composition")
    print(combined.iloc[test,]["gm_naam"].value_counts())

You can see the test is only amsterdam:

train composition
Amsterdam     48
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    25
Name: gm_naam, dtype: int64
train composition
Amsterdam     49
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    24
Name: gm_naam, dtype: int64
train composition
Amsterdam     49
Copenhagen    33
Stockholm     21
Name: gm_naam, dtype: int64
test composition
Amsterdam    24
Name: gm_naam, dtype: int64

Then cross val this:

score = cross_val_score(RandomForestRegressor(random_state= 42),
                        X = combined.iloc[:, 4:], 
                        y = combined.iloc[:, 3], 
                        cv= cvs, scoring="neg_mean_squared_error")

Cross validation for a subset using Python K-Fold and Random Forest

Question

1 answers

solution1
0 2021-05-13 13:45:45

Cross validation for a subset using Python K-Fold and Random Forest

Question

1 answers

solution1 0 2021-05-13 13:45:45

solution1
0 2021-05-13 13:45:45