简体   繁体   中英

Python : get random data from dataframe pandas

Have a df with values :

name     algo      accuracy
tom       1         88
tommy     2         87
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

How to randomly pick 4 records from df with a condition that at least one record should be picked from each unique algo column values. here, algo column has only 3 unique values (1 , 2 , 3 )

Sample outputs:

name     algo      accuracy
tom       1         88
tommy     2         87
stuart    3         100
lincoln   1         88

sample output2:

name     algo      accuracy
mark      1         88
stuart    3         100
alex      2         99
lincoln   1         88

One way

num_sample, num_algo = 4, 3

# sample one for each algo
out = df.groupby('algo').sample(n=num_sample//num_algo)

# append one more sample from those that didn't get selected.
out = out.append(df.drop(out.index).sample(n=num_sample-num_algo) )

Another way is to shuffle the whole data, enumerate the rows within each algo, sort by that enumeration and take the required number of samples. This is slightly more code than the first approach, but is cheaper and produces more balanced algo counts:

# shuffle data
df_random = df['algo'].sample(frac=1)

# enumerations of rows with the same algo
enums = df_random.groupby(df_random).cumcount()

# sort with `np.argsort`:
enums = enums.sort_values()

# pick the first num_sample indices
# these will be indices of the samples
# so we can use `loc`
out = df.loc[enums.iloc[:num_sample].index]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM