简体   繁体   中英

Replace the Outlier in a group with the mean of the group in a pandas series

In the following dataframe I want to replace the outliers in the EMI column with the mode of the group. Here's sample data.

Id C_Id EMI
1 1000 141
2 1000 141
3 1000 21538
4 2000 313
5 2000 313
6 2000 31528
7 3000 0
8 3000 0
9 3000 3000
10 3000 4000

I am expecting the output to be like this.

Id C_Id EMI
1 1000 141
2 1000 141
3 1000 141
4 2000 313
5 2000 313
6 2000 313
7 3000 0
8 3000 0
9 3000 0
10 3000 0

First step is to have modes calculated:

from scipy import stats
modes = df.groupby('C_Id').agg({'EMI':lambda x:stats.mode(x)[0]}).reset_index()
modes

Which will give you:

C_Id EMI
0 1000 141
1 2000 313
2 3000 0

Then it depends on your definition of "outlier". If you simply meant outliers be a value different than mode, its simply:

df.drop(columns = ['EMI']).merge(modes, on=['C_Id'])
Id C_Id EMI
0 1 1000 141
1 2 1000 141
2 3 1000 141
3 4 2000 313
4 5 2000 313
5 6 2000 313
6 7 3000 0
7 8 3000 0
8 9 3000 0
9 10 3000 0

however if you have some criteria you can do:

merged = df.merge(modes, on=['C_Id'], suffixes=['', '_y'])
merged['replacement'] = merged.EMI.gt(merged.EMI_y*10) # use your criteria of outlier here
merged.loc[merged.replacement,'EMI'] = merged.loc[merged.replacement,'EMI_y']
merged.drop(columns=['EMI_y', 'replacement'])

Which will still give the same output for your example usecase however its comparisons will be based on the criteria you set:

Id C_Id EMI
0 1 1000 141
1 2 1000 141
2 3 1000 141
3 4 2000 313
4 5 2000 313
5 6 2000 313
6 7 3000 0
7 8 3000 0
8 9 3000 0
9 10 3000 0

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM