[英]Pandas stratified sampling by count
我想创建一个示例列,该列将通过sId
和cId
vcount
df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1', 4: 's2', 5: 's2', 6: 's2', 7: 's2', 8: 's3', 9: 's3', 10: 's3', 11: 's3', 12: 's3'}, 'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3', 4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7', 8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11', 12: 'c12'}, 'vcount': {0: 322, 1: 168, 2: 1818, 3: 81, 4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979, 9: 1561, 10: 1548, 11: 1009, 12: 11}})
sId cId vcount
0 s0 c0 322
1 s0 c1 168
2 s1 c2 1818
3 s1 c3 81
4 s2 c4 13114
5 s2 c5 5
6 s2 c6 3
7 s2 c7 2
8 s3 c8 1979
9 s3 c9 1561
10 s3 c10 1548
11 s3 c11 1009
12 s3 c12 11
现在我需要它来处理样品 100,预计 output
sId cId vcount sample
0 s0 c0 322 50
1 s0 c1 168 50
2 s1 c2 1818 50
3 s1 c3 81 50
4 s2 c4 13114 90
5 s2 c5 5 5
6 s2 c6 3 3
7 s2 c7 2 2
8 s3 c8 1979 22
9 s3 c9 1561 22
10 s3 c10 1548 22
11 s3 c11 1009 23
12 s3 c12 11 11
如您所见,sId s2 有 4 个 cId,因此我们希望每个 cId 有 25 个; 但是一个 1 有超过 25 个,所以我们必须 select 所有其他 cId 并从 c4 获取剩余的。 同样,s0 有 2 个 cId,所以我们每个 cId 需要 50 个,并且每个 cId 有超过 50 个样本。 对于 s3 来说,哪一个是最大的样本并不重要,我只需要分布尽可能均匀。
目标是cId
每个 sId 的所有sId
并尽可能均匀地划分 100。
我无法弄清楚这一点并在示例列中手动输入; 但是,当列表变大时,这不是一个合理的解决方案。
尝试类似:
near_split
将 integer拆分为 bins 。sId
并应用get_sample
vcount
的值的主要sample
列。vcount
的掩码,其中小于组中的total_sample
/ rows
vcount
获取值小于 min 样本的值系列sample
否定掩码(其中vcount
是 GTE 而不是最小样本)以均匀分布剩余样本。import pandas as pd
df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1',
4: 's2', 5: 's2', 6: 's2', 7: 's2',
8: 's3', 9: 's3', 10: 's3', 11: 's3',
12: 's3'},
'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3',
4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7',
8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11',
12: 'c12'},
'vcount': {0: 322, 1: 168, 2: 1818, 3: 81,
4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979,
9: 1561, 10: 1548, 11: 1009,
12: 11}})
# Control Variables
total_sample = 100
def near_split(x, num_bins):
if num_bins <= 0:
return
quotient, remainder = divmod(x, num_bins)
return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)
def get_sample(g):
# How Many Values In Group
rows = len(g)
# Prime sample with values of vcount
g['sample'] = g['vcount']
# Get locations Where vcount is less than number of samples
lt_mask = g['vcount'] < (total_sample / rows)
# Get Series of vcount that match lt_mask
lt_s = g.loc[lt_mask, 'vcount']
# Sum lt_s and subtract from total_sample to get remaining
# Distribute remaining evenly among GTE rows
# Set ~lt_mask sample to the calculated distribution
g.loc[~lt_mask, 'sample'] = \
near_split(total_sample - lt_s.sum(), rows - len(lt_s))
return g
new_df = df.groupby('sId').apply(get_sample)
# For Display
print(new_df)
Output:
sId cId vcount sample
0 s0 c0 322 50
1 s0 c1 168 50
2 s1 c2 1818 50
3 s1 c3 81 50
4 s2 c4 13114 90
5 s2 c5 5 5
6 s2 c6 3 3
7 s2 c7 2 2
8 s3 c8 1979 23
9 s3 c9 1561 22
10 s3 c10 1548 22
11 s3 c11 1009 22
12 s3 c12 11 11
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.