简体   繁体   中英

sampling dataframe based on quantile (pandas)

I have a data frame that I want to sample based on an argument num_samples . I want to uniformly sample based on Age across quantiles.

For example, if my dataframe has 1000 rows and num_samples =.5 I would need to sample 500 rows, but 125 from each quantile.

The first few records of my dataframe looks like this:

Age  x1 x2 x3
12   1  1  2
45   2  1  3
67   4  1  2
11   3  4  10
18   9  7  6
45   3  5  8
78   8  4  7
64   6  2  3
33   3  2  2

How can I do this in python/pandas?

Create a column quantile which has bin for the Age1 . Then use boolean masking and resample to sample from each bin, use pd.concat to concat the samples obtained for each bin.

labels = ['q1', 'q2', 'q3', 'q4']
df['quantile'] = pd.qcut(df.Age, q = 4, labels = labels)

out = pd.concat([df[df['quantile'].eq(label)].sample(1) for label in labels])

Prints:

>>> out
   Age  x1  x2  x3 quantile
4   18   9   7   6       q1
8   33   3   2   2       q2
7   64   6   2   3       q3
2   67   4   1   2       q4

PS For sampling n samples, change sample(1) to sample(n) .

From Pandas 1.1.0, there's groupby().sample so you can do something like this:

df.groupby(pd.qcut(df.Age, duplicates='drop')).sample(frac=0.5)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM