[英]Replace the Outlier in a group with the mean of the group in a pandas series
In the following dataframe I want to replace the outliers in the EMI column with the mode of the group.在下面的dataframe中,我想用组的模式替换EMI列中的异常值。 Here's sample data.这是示例数据。
Id ID | C_Id C_Id | EMI电磁干扰 |
---|---|---|
1 1个 | 1000 1000 | 141 141 |
2 2个 | 1000 1000 | 141 141 |
3 3个 | 1000 1000 | 21538 21538 |
4 4个 | 2000 2000 | 313 313 |
5 5个 | 2000 2000 | 313 313 |
6 6个 | 2000 2000 | 31528 31528 |
7 7 | 3000 3000 | 0 0 |
8 8个 | 3000 3000 | 0 0 |
9 9 | 3000 3000 | 3000 3000 |
10 10 | 3000 3000 | 4000 4000 |
I am expecting the output to be like this.我期待 output 是这样的。
Id ID | C_Id C_Id | EMI电磁干扰 |
---|---|---|
1 1个 | 1000 1000 | 141 141 |
2 2个 | 1000 1000 | 141 141 |
3 3个 | 1000 1000 | 141 141 |
4 4个 | 2000 2000 | 313 313 |
5 5个 | 2000 2000 | 313 313 |
6 6个 | 2000 2000 | 313 313 |
7 7 | 3000 3000 | 0 0 |
8 8个 | 3000 3000 | 0 0 |
9 9 | 3000 3000 | 0 0 |
10 10 | 3000 3000 | 0 0 |
First step is to have modes calculated:第一步是计算模式:
from scipy import stats
modes = df.groupby('C_Id').agg({'EMI':lambda x:stats.mode(x)[0]}).reset_index()
modes
Which will give you:这会给你:
C_Id C_Id | EMI电磁干扰 | |
---|---|---|
0 0 | 1000 1000 | 141 141 |
1 1个 | 2000 2000 | 313 313 |
2 2个 | 3000 3000 | 0 0 |
Then it depends on your definition of "outlier".那么就看你对“离群值”的定义了。 If you simply meant outliers be a value different than mode, its simply:如果您只是意味着离群值是不同于模式的值,那么它很简单:
df.drop(columns = ['EMI']).merge(modes, on=['C_Id'])
Id ID | C_Id C_Id | EMI电磁干扰 | |
---|---|---|---|
0 0 | 1 1个 | 1000 1000 | 141 141 |
1 1个 | 2 2个 | 1000 1000 | 141 141 |
2 2个 | 3 3个 | 1000 1000 | 141 141 |
3 3个 | 4 4个 | 2000 2000 | 313 313 |
4 4个 | 5 5个 | 2000 2000 | 313 313 |
5 5个 | 6 6个 | 2000 2000 | 313 313 |
6 6个 | 7 7 | 3000 3000 | 0 0 |
7 7 | 8 8个 | 3000 3000 | 0 0 |
8 8个 | 9 9 | 3000 3000 | 0 0 |
9 9 | 10 10 | 3000 3000 | 0 0 |
however if you have some criteria you can do:但是,如果您有一些标准,您可以这样做:
merged = df.merge(modes, on=['C_Id'], suffixes=['', '_y'])
merged['replacement'] = merged.EMI.gt(merged.EMI_y*10) # use your criteria of outlier here
merged.loc[merged.replacement,'EMI'] = merged.loc[merged.replacement,'EMI_y']
merged.drop(columns=['EMI_y', 'replacement'])
Which will still give the same output for your example usecase however its comparisons will be based on the criteria you set:对于您的示例用例,它仍然会给出相同的 output 但其比较将基于您设置的标准:
Id ID | C_Id C_Id | EMI电磁干扰 | |
---|---|---|---|
0 0 | 1 1个 | 1000 1000 | 141 141 |
1 1个 | 2 2个 | 1000 1000 | 141 141 |
2 2个 | 3 3个 | 1000 1000 | 141 141 |
3 3个 | 4 4个 | 2000 2000 | 313 313 |
4 4个 | 5 5个 | 2000 2000 | 313 313 |
5 5个 | 6 6个 | 2000 2000 | 313 313 |
6 6个 | 7 7 | 3000 3000 | 0 0 |
7 7 | 8 8个 | 3000 3000 | 0 0 |
8 8个 | 9 9 | 3000 3000 | 0 0 |
9 9 | 10 10 | 3000 3000 | 0 0 |
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.