Pandas 按计数分层抽样

Question

我想创建一个示例列，该列将通过sId和cId vcount

df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1', 4: 's2', 5: 's2', 6: 's2', 7: 's2', 8: 's3', 9: 's3', 10: 's3', 11: 's3', 12: 's3'}, 'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3', 4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7', 8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11', 12: 'c12'}, 'vcount': {0: 322, 1: 168, 2: 1818, 3: 81, 4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979, 9: 1561, 10: 1548, 11: 1009, 12: 11}})

      sId      cId     vcount
0      s0       c0     322
1      s0       c1     168
2      s1       c2    1818
3      s1       c3      81
4      s2       c4   13114
5      s2       c5       5
6      s2       c6       3
7      s2       c7       2
8      s3       c8    1979
9      s3       c9    1561
10     s3      c10    1548
11     s3      c11    1009
12     s3      c12      11

现在我需要它来处理样品 100，预计 output

      sId      cId  vcount  sample
0      s0       c0     322      50
1      s0       c1     168      50
2      s1       c2    1818      50
3      s1       c3      81      50
4      s2       c4   13114      90
5      s2       c5       5       5
6      s2       c6       3       3
7      s2       c7       2       2
8      s3       c8    1979      22
9      s3       c9    1561      22
10     s3      c10    1548      22
11     s3      c11    1009      23
12     s3      c12      11      11

如您所见，sId s2 有 4 个 cId，因此我们希望每个 cId 有 25 个； 但是一个 1 有超过 25 个，所以我们必须 select 所有其他 cId 并从 c4 获取剩余的。 同样，s0 有 2 个 cId，所以我们每个 cId 需要 50 个，并且每个 cId 有超过 50 个样本。 对于 s3 来说，哪一个是最大的样本并不重要，我只需要分布尽可能均匀。

目标是cId每个 sId 的所有sId并尽可能均匀地划分 100。

我无法弄清楚这一点并在示例列中手动输入； 但是，当列表变大时，这不是一个合理的解决方案。

Answer 1

尝试类似：

从这个 SO 问题中获取near_split将 integer拆分为 bins 。
Groupby sId并应用get_sample
使用来自vcount的值的主要sample列。
创建vcount的掩码，其中小于组中的total_sample / rows
从vcount获取值小于 min 样本的值系列
分配给sample否定掩码（其中vcount是 GTE 而不是最小样本）以均匀分布剩余样本。

import pandas as pd

df = pd.DataFrame({'sId': {0: 's0', 1: 's0', 2: 's1', 3: 's1',
                           4: 's2', 5: 's2', 6: 's2', 7: 's2',
                           8: 's3', 9: 's3', 10: 's3', 11: 's3',
                           12: 's3'},
                   'cId': {0: 'c0', 1: 'c1', 2: 'c2', 3: 'c3',
                           4: 'c4', 5: 'c5', 6: 'c6', 7: 'c7',
                           8: 'c8', 9: 'c9', 10: 'c10', 11: 'c11',
                           12: 'c12'},
                   'vcount': {0: 322, 1: 168, 2: 1818, 3: 81,
                              4: 13114, 5: 5, 6: 3, 7: 2, 8: 1979,
                              9: 1561, 10: 1548, 11: 1009,
                              12: 11}})

# Control Variables
total_sample = 100


def near_split(x, num_bins):
    if num_bins <= 0:
        return
    quotient, remainder = divmod(x, num_bins)
    return [quotient + 1] * remainder + [quotient] * (num_bins - remainder)


def get_sample(g):
    # How Many Values In Group
    rows = len(g)
    # Prime sample with values of vcount
    g['sample'] = g['vcount']
    # Get locations Where vcount is less than number of samples
    lt_mask = g['vcount'] < (total_sample / rows)
    # Get Series of vcount that match lt_mask
    lt_s = g.loc[lt_mask, 'vcount']
    # Sum lt_s and subtract from total_sample to get remaining
    # Distribute remaining evenly among GTE rows
    # Set ~lt_mask sample to the calculated distribution
    g.loc[~lt_mask, 'sample'] = \
        near_split(total_sample - lt_s.sum(), rows - len(lt_s))
    return g


new_df = df.groupby('sId').apply(get_sample)

# For Display
print(new_df)

Output：

   sId  cId  vcount  sample
0   s0   c0     322      50
1   s0   c1     168      50
2   s1   c2    1818      50
3   s1   c3      81      50
4   s2   c4   13114      90
5   s2   c5       5       5
6   s2   c6       3       3
7   s2   c7       2       2
8   s3   c8    1979      23
9   s3   c9    1561      22
10  s3  c10    1548      22
11  s3  c11    1009      22
12  s3  c12      11      11

Pandas 按计数分层抽样

问题描述

1 个解决方案

解决方案1
2 已采纳 2021-05-01 03:59:05

Pandas 按计数分层抽样

问题描述

1 个解决方案

解决方案1 2 已采纳 2021-05-01 03:59:05

解决方案1
2 已采纳 2021-05-01 03:59:05