简体   繁体   English

如何使欠采样具有 25% 的类别 0 输入并且不改变 Python 中的类别 1?

[英]How to make undersampling to have 25% of imput fo category 0 and does not changes in category 1 in Python?

I have imbalanced dataset in Python like: 95% of 0 and 5% of 1.我在 Python 中有不平衡的数据集,例如:95% 的 0 和 5% 的 1。

How can I make undersampling to reduce number of zeros to have only 25% of input dataset ?如何进行欠采样以减少零的数量,使其仅占输入数据集的 25%?

I ask you because on the internet source I see only undesampling codes which cause that my dataset is balanced 50% of 0 and 50% of 1 and I do not want to have that, I only want to reduce my number of zeroes to level of 25% in dataset我问你是因为在互联网资源上我只看到反采样代码,这导致我的数据集平衡了 0 的 50% 和 1 的 50%,我不想这样,我只想将零的数量减少到数据集中的 25%

How can I do that in Python?我怎样才能在 Python 中做到这一点? Have you some example codes?你有一些示例代码吗?

To apply different rules to different values, you can use groupby .要将不同的规则应用于不同的值,您可以使用groupby As you didn't give an example dataset I'm just using a dataframe with a column col , which has 19 zeros and 1 one:由于您没有给出示例数据集,我只是使用了一个带有col列的数据框,其中有 19 个零和 1 个 1:

>>> df.shape
(20, 2)
>>> df['col'].value_counts() / len(df)
0      0.95
1      0.05
Name: col, dtype: float64

Now groupby.sample() doesn't allow setting different numbers or fractions per group, so we can simply use groupby.apply() which itself can call sample() on the dataframes:现在groupby.sample()不允许为每组设置不同的数字或分数,所以我们可以简单地使用groupby.apply() ,它本身可以在数据帧上调用sample()

>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
        col foo
col            
0   6     0   g
    16    0   q
    3     0   d
    14    0   o
    15    0   p
1   19    1   t
>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
        col foo
col            
0   16    0   q
    5     0   f
    13    0   n
    2     0   c
    9     0   j
1   19    1   t

Note that I'm using the fact that the value used to decide the group is passed inside apply by setting a .name property on the dataframe.请注意,我使用的事实是,通过在数据帧上设置.name属性,用于确定组的值在apply内部传递。

You can add .droplevel('col') at the end to remove the first index level.您可以在.droplevel('col')添加.droplevel('col')以删除第一个索引级别。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM