I have a pd.DataFrame
with a similar structure to the sample below:
index x y z
0 x0 y0 None
1 x1 y1 None
2 x2 y2 None
3 x3 y3 None
4 x4 y4 None
5 x5 y5 None
6 x6 y6 None
My goal is to create 3 subsets of the DataFrame:
Group1
is a training set can be use to train a model to predict z with x and y ; Group2
is a validation set that is used to evaluate the accuracy of the model (or different models/tunings of parameters) trained in Group1, and I will fill out the correct value of z for both Group1
and 2
.
Group3
is held until a model is chosen to predict z .
In this case, what would be the most efficient way to do the assignment? I was thinking about simply create sub groups within one DataFrame, as below:
index x y z group
- - - - - - - - - - - - - - - - - - - -
0 x0 y0 None training
1 x1 y1 None validation
2 x2 y2 None held out
3 x3 y3 None held out
4 x4 y4 None validation
5 x5 y5 None training
6 x6 y6 None held out
But the tips on random assignment I've seen elsewhere normally create a new DataFrame. Is it because this is more feasible?
rows = np.random.choice(df.index.values, 10)
sampled_df = df.ix[rows]
Also, since I want to sample 3 groups instead of 2 at once, I am not sure what is the best way to sample without replacement.
You could use
df['group'] = np.random.choice(
np.repeat(['training', 'validation', 'held out'], (2,2,3)), len(df), replace=False)
to assign a training/validation/held out
label to each row. The (2,2,3)
above indicates the number of rows of each type you wish to have. Since each row should get a label, the sum of the tuple should equal len(df)
.
Is assigning labels better than creating sub-DataFrames?
If you assign labels, you'll end up with code like:
df['group'] = np.random.choice(
np.repeat(['training', 'validation', 'held out'], (2,2,3)), len(df), replace=False)
goodness = dict()
params = dicts()
for model in models:
params[model] = fit(model, df.loc[df['group'] == 'train'])
goodness[model] = validate(model, params[model], df.loc[df['group'] == 'validation'])
best_model = max(models, key=goodness.get)
result = process(best_model, params[best_model], df.loc[df['group'] == 'held_out'])
If you split df
(using MaxU's solution ), you'll end up with code like:
train, validate, held_out = np.split(df.sample(frac=1), [2,4])
goodness = dict()
params = dicts()
for model in models:
params[model] = fit(model, train)
goodness[model] = validate(model, params[model], validate)
best_model = max(models, key=goodness.get)
result = process(best_model, params[best_model], held_out)
Each time Python encounters df['group'] == 'train'
, the entire Series df['group']
is scanned -- an O(N) operation. df.loc[f['group'] == 'train']
then copies rows from df
to form a new sub-DataFrame. Since this is done in a loop, for model in models
, and is done two times for each loop, this O(N) operation is performed 2*len(model)
times.
In contrast, if you split the DataFrame at the very beginning, then the copying is only done once. So MaxU's code is faster.
On the other hand, using the labels to create sub-DataFrames on demand will save a bit of memory since you won't be instantiating all three sub-DataFrames at once. However, unless you are really tight on memory you'll probably want faster code than more memory efficient code. So if that's the case, use MaxU's solution .
Of course, you could use
df['group'] = np.random.choice(
np.repeat(['training', 'validation', 'held out'], (2,2,3)), len(df), replace=False)
train, validate, held_out = [df.loc[df['group'] == label] for label in ['train', 'validation', 'held out']]
instead of
train, validate, held_out = np.split(df.sample(frac=1), [2,4])
but there is no speed or memory advantage to doing it this way either. You'd still be scanning and copying from the DataFrame three times instead of once. So again MaxU's solution should be preferred.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.