[英]How to make undersampling to have 25% of imput fo category 0 and does not changes in category 1 in Python?
I have imbalanced dataset in Python like: 95% of 0 and 5% of 1.我在 Python 中有不平衡的数据集,例如:95% 的 0 和 5% 的 1。
How can I make undersampling to reduce number of zeros to have only 25% of input dataset ?如何进行欠采样以减少零的数量,使其仅占输入数据集的 25%?
I ask you because on the internet source I see only undesampling codes which cause that my dataset is balanced 50% of 0 and 50% of 1 and I do not want to have that, I only want to reduce my number of zeroes to level of 25% in dataset我问你是因为在互联网资源上我只看到反采样代码,这导致我的数据集平衡了 0 的 50% 和 1 的 50%,我不想这样,我只想将零的数量减少到数据集中的 25%
How can I do that in Python?我怎样才能在 Python 中做到这一点? Have you some example codes?
你有一些示例代码吗?
To apply different rules to different values, you can use groupby
.要将不同的规则应用于不同的值,您可以使用
groupby
。 As you didn't give an example dataset I'm just using a dataframe with a column col
, which has 19 zeros and 1 one:由于您没有给出示例数据集,我只是使用了一个带有
col
列的数据框,其中有 19 个零和 1 个 1:
>>> df.shape
(20, 2)
>>> df['col'].value_counts() / len(df)
0 0.95
1 0.05
Name: col, dtype: float64
Now groupby.sample()
doesn't allow setting different numbers or fractions per group, so we can simply use groupby.apply()
which itself can call sample()
on the dataframes:现在
groupby.sample()
不允许为每组设置不同的数字或分数,所以我们可以简单地使用groupby.apply()
,它本身可以在数据帧上调用sample()
:
>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
col foo
col
0 6 0 g
16 0 q
3 0 d
14 0 o
15 0 p
1 19 1 t
>>> df.groupby('col').apply(lambda g: g.sample(frac=.25 if g.name == 0 else 1))
col foo
col
0 16 0 q
5 0 f
13 0 n
2 0 c
9 0 j
1 19 1 t
Note that I'm using the fact that the value used to decide the group is passed inside apply
by setting a .name
property on the dataframe.请注意,我使用的事实是,通过在数据帧上设置
.name
属性,用于确定组的值在apply
内部传递。
You can add .droplevel('col')
at the end to remove the first index level.您可以在
.droplevel('col')
添加.droplevel('col')
以删除第一个索引级别。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.