简体   繁体   中英

Randomly selecting a subset of rows from a pandas dataframe based on existing column values

I have a dataframe that contains multiple columns and would like to randomly select an equal number of rows based on the values of specific column. I thought of using a df.groupby['...'] but it did not work. Here is an example:

数据框

Assume that I would like to randomly select one row per GroupID, how can I accomplish this? For example, let's say I select one random row per GroupID, the result would yield the following:

数据帧随机选择

such that it outputs a single row based on the values in GroupID. Assume, for example, that rows are sorted by GroupID (from asc to desc), then select an "n" number of rows from ones that pertain to GroupID 1, 2, 3, and so on. Any information would definitely be helpful.

Also, if I need to select a specific number of rows per GroupID (let's say 1 row for GroupID=100, 4 rows for GroupID=200, etc.), any ideas?

[update] I made a minor revision or extension using the recommended answer below to selectively choose specific n values for each group using the following:

samples = []
values = [1,  1,  2, ...]
index = 0
for group in df.GroupID.unique():
    s = df.loc[df.GroupID== group].sample(n=values[index ]).reset_index(drop=True)
    samples.append(s)
    index = index + 1
    
sample = pd.concat(samples, axis=0)

I hope this code snippet will work for you

samples = []
for group in df.GroupID.unique():
    s = df.loc[df.GroupID== group].sample(n=1).reset_index(drop=True)
    samples.append(s)
    
sample = pd.concat(samples, axis=0)

The code will take each 'GroupID' and sample observation from that subgroup. You can concatenate the subsamples (with one GroupID) for the needed sample.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM