I have a data frame that I want to sample based on an argument num_samples
. I want to uniformly sample based on Age across quantiles.
For example, if my dataframe has 1000 rows and num_samples =.5
I would need to sample 500 rows, but 125 from each quantile.
The first few records of my dataframe looks like this:
Age x1 x2 x3
12 1 1 2
45 2 1 3
67 4 1 2
11 3 4 10
18 9 7 6
45 3 5 8
78 8 4 7
64 6 2 3
33 3 2 2
How can I do this in python/pandas?
Create a column quantile which has bin
for the Age1
. Then use boolean masking and resample to sample from each bin, use pd.concat
to concat the samples obtained for each bin.
labels = ['q1', 'q2', 'q3', 'q4']
df['quantile'] = pd.qcut(df.Age, q = 4, labels = labels)
out = pd.concat([df[df['quantile'].eq(label)].sample(1) for label in labels])
Prints:
>>> out
Age x1 x2 x3 quantile
4 18 9 7 6 q1
8 33 3 2 2 q2
7 64 6 2 3 q3
2 67 4 1 2 q4
PS For sampling n samples, change sample(1)
to sample(n)
.
From Pandas 1.1.0, there's groupby().sample
so you can do something like this:
df.groupby(pd.qcut(df.Age, duplicates='drop')).sample(frac=0.5)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.