简体   繁体   中英

Replace outlier with mean value

I have the following function that will remove the outlier but I want to replace them with mean value in the same column

        def remove_outlier(df_in, col_name):
        q1 = df_in[col_name].quantile(0.25)
        q3 = df_in[col_name].quantile(0.75)
        iqr = q3-q1 #Interquartile range
        fence_low  = q1-1.5*iqr
        fence_high = q3+1.5*iqr
        df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
        return df_out

Let's try this. Identify the outliers based on your criteria, then directly assign the mean of the column to them for those records that are not outliers.

With some test data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': range(10), 'b': np.random.randn(10)})

# These will be our two outlier points
df.iloc[0] = -5
df.iloc[9] = 5

>>> df
   a         b
0 -5 -5.000000
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5  5.000000

def replace_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df.copy()
    outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
    df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
    return df_out

>>> replace_outlier(df, 'b')

   a         b
0 -5 -0.106019
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5 -0.106019

We can check that the fill value is equal to the mean for all of the other column values:

>>> df.iloc[1:9]['b'].mean()
-0.10601866399896176

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM