I've uploaded a dummy data set, link here . The df.head()
:
It has 4 class in total and df.object.value_counts()
:
human 23
car 13
cat 5
dog 3
I want to do properly K-Fold
validation splits over a multi-class object detection data set.
To achieve proper k-fold validation splits, I took the object counts
and the number of bounding box
into account. I understand, the K-fold
splitting strategies mostly depends on the data set (meta information). But for now with these dataset, I've tried something like as follows:
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=101)
df_folds = main_df[['image_id']].copy()
df_folds.loc[:, 'bbox_count'] = 1
df_folds = df_folds.groupby('image_id').count()
df_folds.loc[:, 'object_count'] = main_df.groupby('image_id')['object'].nunique()
df_folds.loc[:, 'stratify_group'] = np.char.add(
df_folds['object_count'].values.astype(str),
df_folds['bbox_count'].apply(lambda x: f'_{x // 15}').values.astype(str)
)
df_folds.loc[:, 'fold'] = 0
for fold_number, (train_index, val_index) in enumerate(skf.split(X=df_folds.index, y=df_folds['stratify_group'])):
df_folds.loc[df_folds.iloc[val_index].index, 'fold'] = fold_number
After the splitting, I've checked to ensure if it's working. And it seems Ok so far.
All the folds contain stratified k-fold
samples, len(df_folds[df_folds['fold'] == fold_number].index)
and no intersection to each other, set(A).intersection(B)
where A
and B
are the index value ( image_id
) of two folds. But the issue seems like:
Fold 0 has total: 18 + 2 + 3 = 23 bbox
Fold 1 has total: 2 + 11 = 13 bbox
Fold 2 has total: 5 + 3 = 8 bbox
However, I couldn't ensure whether it's the proper way for this type of task in general. I want some advice. Is the above approach OK? or any issue? or there is some better approach! Any sorts of suggestions would be appreciated. Thanks.
When creating a cross-validation split, we care about creating folds which have a good distribution of the various "cases" encountered in the data.
In your case, you decided to base your folds on the number of cars and the number of bounding boxes which is a good but limited choice. So, if you can identify specific cases using your data/metadata, you might try to create smarter folds using it.
The most obvious choice is to balance object types (classes) in your folds, but you could go further.
Here is the main idea, let's say you have images with cars encountered mostly in France, and others with cars encountered mostly in the US, it could be used to create good folds with a balanced number of french and us cars in each fold. Same could be done with weather conditions etc. Thus, each fold will contain representative data to learn from so that your network won't be biased for your task. As a result, your model will be more robust to such potential real life changes in the data.
So, can you add some metadata to your cross-validation strategy to create a better CV? If it's not the case, can you get information about potential corner cases using the x, y, w, h columns of your dataset?
Then you should try to have balanced folds in terms of samples so that your scores are evaluated on the same sample size which will reduce variance and provide a better evaluation at the end.
You can use StratifiedKFold() or StratifiedShuffleSplit() directly to split your data set using stratified sampling based on some categorical column.
Dummy Data:
import pandas as pd
import numpy as np
np.random.seed(43)
df = pd.DataFrame({'ID': (1,1,2,2,3,3),
'Object': ('bus', 'car', 'bus', 'bus', 'bus', 'car'),
'X' : np.random.randint(0, 10, 6),
'Y' : np.random.randn(6)
})
df
Using StratifiedKFold()
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=2)
for train_index, test_index in skf.split(df, df["Object"]):
strat_train_set_1 = df.loc[test_index]
strat_test_set_1 = df.loc[test_index]
print('train_set :', strat_train_set_1, '\n' , 'test_set :', strat_test_set_1)
Similarly, if you choose to use StratifiedShuffleSplit(), you can have
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
# n_splits = Number of re-shuffling & splitting iterations.
for train_index, test_index in sss.split(df, df["Object"]):
# split(X, y[, groups]) Generates indices to split data into training and test set.
strat_train_set = df.loc[train_index]
strat_test_set = df.loc[test_index]
print('train_set :', strat_train_set, '\n' , 'test_set :', strat_test_set)
I'd do this simply using KFold
method of scikit-learn of python
from numpy import array
from sklearn.model_selection import KFold
data = array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6])
kfold = KFold(3, True, 1)
for train, test in kfold.split(data):
print('train: %s, test: %s' % (data[train], data[test]))
and please see if this might be helpful
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.