I have the following function that will remove the outlier but I want to replace them with mean value in the same column
def remove_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
return df_out
Let's try this. Identify the outliers based on your criteria, then directly assign the mean of the column to them for those records that are not outliers.
With some test data:
import pandas as pd
import numpy as np
df = pd.DataFrame({'a': range(10), 'b': np.random.randn(10)})
# These will be our two outlier points
df.iloc[0] = -5
df.iloc[9] = 5
>>> df
a b
0 -5 -5.000000
1 1 1.375111
2 2 -1.004325
3 3 -1.326068
4 4 1.689807
5 5 -0.181405
6 6 -1.016909
7 7 -0.039639
8 8 -0.344721
9 5 5.000000
def replace_outlier(df_in, col_name):
q1 = df_in[col_name].quantile(0.25)
q3 = df_in[col_name].quantile(0.75)
iqr = q3-q1 #Interquartile range
fence_low = q1-1.5*iqr
fence_high = q3+1.5*iqr
df_out = df.copy()
outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
return df_out
>>> replace_outlier(df, 'b')
a b
0 -5 -0.106019
1 1 1.375111
2 2 -1.004325
3 3 -1.326068
4 4 1.689807
5 5 -0.181405
6 6 -1.016909
7 7 -0.039639
8 8 -0.344721
9 5 -0.106019
We can check that the fill value is equal to the mean for all of the other column values:
>>> df.iloc[1:9]['b'].mean()
-0.10601866399896176
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.