简体   繁体   中英

remove outlier in dataframe of python

would like to remove the outlier of the DataFrame using the mean and standard deviation in Python. But I want to make it na instead of simply deleting outliers. And then i want to save it again in the form of Dataframe. This is my question.

I thought about the code below, but I do not know what to do more here. I don't care if I can solve my problems in any way, if not the following way.

df_group = df.groupby('count')
df_group_mean = df_group.mean()
df_group_std = df_group.std()
index_list = df_group_mean.index
col_list = ["A", "B", "C", "D"]

for IndexList in index_list:
    temp = df.iloc[IndexList]
    
    for ColList in col_list:
        mean = df_group_mean.loc[IndexList, ColList]
        std = df_group_std.loc[IndexList, ColList]        
        temp[ColList] = np.where(temp[ColList] > mean + (std * sigma), np.nan, temp[ColList])
        temp[ColList] = np.where(temp[ColList] < mean - (std * sigma), np.nan, temp[ColList])

You probably need something like this:

import pandas as pd
import numpy as np

df = pd.DataFrame({'x':[-30,-2,0,1,2,4,5,7,8,9,10,10,34]})

Label values that are 2 standard deviations beyond or below the mean as an outlier. In this example the first and last value will be turned into NAN.

df[ (df['x'] > df['x'].mean()+2*df['x'].std()) | (df['x'] < df['x'].mean()-2*df['x'].std()) ] = np.nan

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM