简体   繁体   English

将组中的异常值替换为 pandas 系列中组的平均值

[英]Replace the Outlier in a group with the mean of the group in a pandas series

In the following dataframe I want to replace the outliers in the EMI column with the mode of the group.在下面的dataframe中,我想用组的模式替换EMI列中的异常值。 Here's sample data.这是示例数据。

Id ID C_Id C_Id EMI电磁干扰
1 1个 1000 1000 141 141
2 2个 1000 1000 141 141
3 3个 1000 1000 21538 21538
4 4个 2000 2000 313 313
5 5个 2000 2000 313 313
6 6个 2000 2000 31528 31528
7 7 3000 3000 0 0
8 8个 3000 3000 0 0
9 9 3000 3000 3000 3000
10 10 3000 3000 4000 4000

I am expecting the output to be like this.我期待 output 是这样的。

Id ID C_Id C_Id EMI电磁干扰
1 1个 1000 1000 141 141
2 2个 1000 1000 141 141
3 3个 1000 1000 141 141
4 4个 2000 2000 313 313
5 5个 2000 2000 313 313
6 6个 2000 2000 313 313
7 7 3000 3000 0 0
8 8个 3000 3000 0 0
9 9 3000 3000 0 0
10 10 3000 3000 0 0

First step is to have modes calculated:第一步是计算模式:

from scipy import stats
modes = df.groupby('C_Id').agg({'EMI':lambda x:stats.mode(x)[0]}).reset_index()
modes

Which will give you:这会给你:

C_Id C_Id EMI电磁干扰
0 0 1000 1000 141 141
1 1个 2000 2000 313 313
2 2个 3000 3000 0 0

Then it depends on your definition of "outlier".那么就看你对“离群值”的定义了。 If you simply meant outliers be a value different than mode, its simply:如果您只是意味着离群值是不同于模式的值,那么它很简单:

df.drop(columns = ['EMI']).merge(modes, on=['C_Id'])
Id ID C_Id C_Id EMI电磁干扰
0 0 1 1个 1000 1000 141 141
1 1个 2 2个 1000 1000 141 141
2 2个 3 3个 1000 1000 141 141
3 3个 4 4个 2000 2000 313 313
4 4个 5 5个 2000 2000 313 313
5 5个 6 6个 2000 2000 313 313
6 6个 7 7 3000 3000 0 0
7 7 8 8个 3000 3000 0 0
8 8个 9 9 3000 3000 0 0
9 9 10 10 3000 3000 0 0

however if you have some criteria you can do:但是,如果您有一些标准,您可以这样做:

merged = df.merge(modes, on=['C_Id'], suffixes=['', '_y'])
merged['replacement'] = merged.EMI.gt(merged.EMI_y*10) # use your criteria of outlier here
merged.loc[merged.replacement,'EMI'] = merged.loc[merged.replacement,'EMI_y']
merged.drop(columns=['EMI_y', 'replacement'])

Which will still give the same output for your example usecase however its comparisons will be based on the criteria you set:对于您的示例用例,它仍然会给出相同的 output 但其比较将基于您设置的标准:

Id ID C_Id C_Id EMI电磁干扰
0 0 1 1个 1000 1000 141 141
1 1个 2 2个 1000 1000 141 141
2 2个 3 3个 1000 1000 141 141
3 3个 4 4个 2000 2000 313 313
4 4个 5 5个 2000 2000 313 313
5 5个 6 6个 2000 2000 313 313
6 6个 7 7 3000 3000 0 0
7 7 8 8个 3000 3000 0 0
8 8个 9 9 3000 3000 0 0
9 9 10 10 3000 3000 0 0

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM