Replace outlier with mean value

Question

I have the following function that will remove the outlier but I want to replace them with mean value in the same column

        def remove_outlier(df_in, col_name):
        q1 = df_in[col_name].quantile(0.25)
        q3 = df_in[col_name].quantile(0.75)
        iqr = q3-q1 #Interquartile range
        fence_low  = q1-1.5*iqr
        fence_high = q3+1.5*iqr
        df_out = df_in.loc[(df_in[col_name] > fence_low) & (df_in[col_name] < fence_high)]
        return df_out

Answer 1

Let's try this. Identify the outliers based on your criteria, then directly assign the mean of the column to them for those records that are not outliers.

With some test data:

import pandas as pd
import numpy as np

df = pd.DataFrame({'a': range(10), 'b': np.random.randn(10)})

# These will be our two outlier points
df.iloc[0] = -5
df.iloc[9] = 5

>>> df
   a         b
0 -5 -5.000000
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5  5.000000

def replace_outlier(df_in, col_name):
    q1 = df_in[col_name].quantile(0.25)
    q3 = df_in[col_name].quantile(0.75)
    iqr = q3-q1 #Interquartile range
    fence_low  = q1-1.5*iqr
    fence_high = q3+1.5*iqr
    df_out = df.copy()
    outliers = ~df_out[col_name].between(fence_low, fence_high, inclusive=False)
    df_out.loc[outliers, col_name] = df_out.loc[~outliers, col_name].mean()
    return df_out

>>> replace_outlier(df, 'b')

   a         b
0 -5 -0.106019
1  1  1.375111
2  2 -1.004325
3  3 -1.326068
4  4  1.689807
5  5 -0.181405
6  6 -1.016909
7  7 -0.039639
8  8 -0.344721
9  5 -0.106019

We can check that the fill value is equal to the mean for all of the other column values:

>>> df.iloc[1:9]['b'].mean()
-0.10601866399896176

Replace outlier with mean value

Question

1 answers

solution1
2 2021-01-29 20:54:57

Replace outlier with mean value

Question

1 answers

solution1 2 2021-01-29 20:54:57

solution1
2 2021-01-29 20:54:57