如何使欠采样具有 25% 的类别 0 输入并且不改变 Python 中的类别 1？

Question

I have imbalanced dataset in Python like: 95% of 0 and 5% of 1.我在 Python 中有不平衡的数据集，例如：95% 的 0 和 5% 的 1。

How can I make undersampling to reduce number of zeros to have only 25% of input dataset ?如何进行欠采样以减少零的数量，使其仅占输入数据集的 25%？

I ask you because on the internet source I see only undesampling codes which cause that my dataset is balanced 50% of 0 and 50% of 1 and I do not want to have that, I only want to reduce my number of zeroes to level of 25% in dataset我问你是因为在互联网资源上我只看到反采样代码，这导致我的数据集平衡了 0 的 50% 和 1 的 50%，我不想这样，我只想将零的数量减少到数据集中的 25%

How can I do that in Python?我怎样才能在 Python 中做到这一点？ Have you some example codes?你有一些示例代码吗？

Answer 1

To apply different rules to different values, you can use groupby .要将不同的规则应用于不同的值，您可以使用groupby 。 As you didn't give an example dataset I'm just using a dataframe with a column col , which has 19 zeros and 1 one:由于您没有给出示例数据集，我只是使用了一个带有col列的数据框，其中有 19 个零和 1 个 1：

>>> df.shape
(20, 2)
>>> df['col'].value_counts() / len(df)
0      0.95
1      0.05
Name: col, dtype: float64

Now groupby.sample() doesn't allow setting different numbers or fractions per group, so we can simply use groupby.apply() which itself can call sample() on the dataframes:现在groupby.sample()不允许为每组设置不同的数字或分数，所以我们可以简单地使用groupby.apply() ，它本身可以在数据帧上调用sample() ：

>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
        col foo
col            
0   6     0   g
    16    0   q
    3     0   d
    14    0   o
    15    0   p
1   19    1   t
>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
        col foo
col            
0   16    0   q
    5     0   f
    13    0   n
    2     0   c
    9     0   j
1   19    1   t

Note that I'm using the fact that the value used to decide the group is passed inside apply by setting a .name property on the dataframe.请注意，我使用的事实是，通过在数据帧上设置.name属性，用于确定组的值在apply内部传递。

You can add .droplevel('col') at the end to remove the first index level.您可以在.droplevel('col')添加.droplevel('col')以删除第一个索引级别。

如何使欠采样具有 25% 的类别 0 输入并且不改变 Python 中的类别 1？

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-07-06 10:14:18

如何使欠采样具有 25% 的类别 0 输入并且不改变 Python 中的类别 1？

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-07-06 10:14:18

解决方案1
0 已采纳 2021-07-06 10:14:18