繁体   English   中英

Pandas 按计数分层抽样

[英]Pandas stratified sampling by count

我想创建一个示例列,该列将通过sIdcId vcount

df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1', 4: 's2', 5: 's2', 6: 's2', 7: 's2', 8: 's3', 9: 's3', 10: 's3', 11: 's3', 12: 's3'}, 'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3', 4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7', 8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11', 12: 'c12'}, 'vcount': {0: 322, 1: 168, 2: 1818, 3: 81, 4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979, 9: 1561, 10: 1548, 11: 1009, 12: 11}})

      sId      cId     vcount
0      s0       c0     322
1      s0       c1     168
2      s1       c2    1818
3      s1       c3      81
4      s2       c4   13114
5      s2       c5       5
6      s2       c6       3
7      s2       c7       2
8      s3       c8    1979
9      s3       c9    1561
10     s3      c10    1548
11     s3      c11    1009
12     s3      c12      11

现在我需要它来处理样品 100,预计 output

      sId      cId  vcount  sample
0      s0       c0     322      50
1      s0       c1     168      50
2      s1       c2    1818      50
3      s1       c3      81      50
4      s2       c4   13114      90
5      s2       c5       5       5
6      s2       c6       3       3
7      s2       c7       2       2
8      s3       c8    1979      22
9      s3       c9    1561      22
10     s3      c10    1548      22
11     s3      c11    1009      23
12     s3      c12      11      11

如您所见,sId s2 有 4 个 cId,因此我们希望每个 cId 有 25 个; 但是一个 1 有超过 25 个,所以我们必须 select 所有其他 cId 并从 c4 获取剩余的。 同样,s0 有 2 个 cId,所以我们每个 cId 需要 50 个,并且每个 cId 有超过 50 个样本。 对于 s3 来说,哪一个是最大的样本并不重要,我只需要分布尽可能均匀。

目标是cId每个 sId 的所有sId并尽可能均匀地划分 100。

我无法弄清楚这一点并在示例列中手动输入; 但是,当列表变大时,这不是一个合理的解决方案。

尝试类似:

  1. 从这个 SO 问题中获取near_split将 integer拆分为 bins
  2. Groupby sId并应用get_sample
  3. 使用来自vcount的值的主要sample列。
  4. 创建vcount的掩码,其中小于组中的total_sample / rows
  5. vcount获取值小于 min 样本的值系列
  6. 分配给sample否定掩码(其中vcount是 GTE 而不是最小样本)以均匀分布剩余样本。
import pandas as pd

df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1',
                           4: 's2', 5: 's2', 6: 's2', 7: 's2',
                           8: 's3', 9: 's3', 10: 's3', 11: 's3',
                           12: 's3'},
                   'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3',
                           4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7',
                           8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11',
                           12: 'c12'},
                   'vcount': {0: 322, 1: 168, 2: 1818, 3: 81,
                              4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979,
                              9: 1561, 10: 1548, 11: 1009,
                              12: 11}})

# Control Variables
total_sample = 100


def near_split(x, num_bins):
    if num_bins <= 0:
        return
    quotient, remainder = divmod(x, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)


def get_sample(g):
    # How Many Values In Group
    rows = len(g)
    # Prime sample with values of vcount
    g['sample'] = g['vcount']
    # Get locations Where vcount is less than number of samples
    lt_mask = g['vcount'] < (total_sample / rows)
    # Get Series of vcount that match lt_mask
    lt_s = g.loc[lt_mask, 'vcount']
    # Sum lt_s and subtract from total_sample to get remaining
    # Distribute remaining evenly among GTE rows
    # Set ~lt_mask sample to the calculated distribution
    g.loc[~lt_mask, 'sample'] = \
        near_split(total_sample - lt_s.sum(), rows - len(lt_s))
    return g


new_df = df.groupby('sId').apply(get_sample)

# For Display
print(new_df)

Output:

   sId  cId  vcount  sample
0   s0   c0     322      50
1   s0   c1     168      50
2   s1   c2    1818      50
3   s1   c3      81      50
4   s2   c4   13114      90
5   s2   c5       5       5
6   s2   c6       3       3
7   s2   c7       2       2
8   s3   c8    1979      23
9   s3   c9    1561      22
10  s3  c10    1548      22
11  s3  c11    1009      22
12  s3  c12      11      11

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM