简体   繁体   中英

Stratified K-Fold For Multi-Class Object Detection?

Updated

I've uploaded a dummy data set, link here . The df.head() :

在此处输入图片说明

It has 4 class in total and df.object.value_counts() :

human    23
car      13
cat       5
dog       3

I want to do properly K-Fold validation splits over a multi-class object detection data set.

Initial Approach

To achieve proper k-fold validation splits, I took the object counts and the number of bounding box into account. I understand, the K-fold splitting strategies mostly depends on the data set (meta information). But for now with these dataset, I've tried something like as follows:

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=101)
df_folds = main_df[['image_id']].copy()

df_folds.loc[:, 'bbox_count'] = 1
df_folds = df_folds.groupby('image_id').count()
df_folds.loc[:, 'object_count'] = main_df.groupby('image_id')['object'].nunique()

df_folds.loc[:, 'stratify_group'] = np.char.add(
    df_folds['object_count'].values.astype(str),
    df_folds['bbox_count'].apply(lambda x: f'_{x // 15}').values.astype(str)
)

df_folds.loc[:, 'fold'] = 0
for fold_number, (train_index, val_index) in enumerate(skf.split(X=df_folds.index, y=df_folds['stratify_group'])):
    df_folds.loc[df_folds.iloc[val_index].index, 'fold'] = fold_number

After the splitting, I've checked to ensure if it's working. And it seems Ok so far.

在此处输入图片说明

All the folds contain stratified k-fold samples, len(df_folds[df_folds['fold'] == fold_number].index) and no intersection to each other, set(A).intersection(B) where A and B are the index value ( image_id ) of two folds. But the issue seems like:

Fold 0 has total: 18 + 2 + 3 = 23 bbox
Fold 1 has total: 2 + 11 = 13 bbox
Fold 2 has total: 5 + 3 = 8 bbox

Concern

However, I couldn't ensure whether it's the proper way for this type of task in general. I want some advice. Is the above approach OK? or any issue? or there is some better approach! Any sorts of suggestions would be appreciated. Thanks.

When creating a cross-validation split, we care about creating folds which have a good distribution of the various "cases" encountered in the data.

In your case, you decided to base your folds on the number of cars and the number of bounding boxes which is a good but limited choice. So, if you can identify specific cases using your data/metadata, you might try to create smarter folds using it.

The most obvious choice is to balance object types (classes) in your folds, but you could go further.

Here is the main idea, let's say you have images with cars encountered mostly in France, and others with cars encountered mostly in the US, it could be used to create good folds with a balanced number of french and us cars in each fold. Same could be done with weather conditions etc. Thus, each fold will contain representative data to learn from so that your network won't be biased for your task. As a result, your model will be more robust to such potential real life changes in the data.

So, can you add some metadata to your cross-validation strategy to create a better CV? If it's not the case, can you get information about potential corner cases using the x, y, w, h columns of your dataset?

Then you should try to have balanced folds in terms of samples so that your scores are evaluated on the same sample size which will reduce variance and provide a better evaluation at the end.

You can use StratifiedKFold() or StratifiedShuffleSplit() directly to split your data set using stratified sampling based on some categorical column.

Dummy Data:

import pandas as pd
import numpy as np

np.random.seed(43)
df = pd.DataFrame({'ID': (1,1,2,2,3,3),
               'Object': ('bus', 'car', 'bus', 'bus', 'bus', 'car'),
               'X' : np.random.randint(0, 10, 6),
               'Y' : np.random.randn(6)

})


df

Using StratifiedKFold()

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=2)

for train_index, test_index in skf.split(df, df["Object"]):
        strat_train_set_1 = df.loc[test_index]
        strat_test_set_1 = df.loc[test_index]

print('train_set :', strat_train_set_1, '\n' , 'test_set :', strat_test_set_1)

Similarly, if you choose to use StratifiedShuffleSplit(), you can have

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# n_splits = Number of re-shuffling & splitting iterations.

for train_index, test_index in sss.split(df, df["Object"]):
 # split(X, y[, groups]) Generates indices to split data into training and test set.

        strat_train_set = df.loc[train_index]
        strat_test_set = df.loc[test_index]

print('train_set :', strat_train_set, '\n' , 'test_set :', strat_test_set)

I'd do this simply using KFold method of scikit-learn of python

from numpy import array
from sklearn.model_selection import KFold
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
kfold = KFold(3, True, 1)
for train, test in kfold.split(data):
    print('train: %s, test: %s' % (data[train], data[test]))

and please see if this might be helpful

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM