简体   繁体   中英

How to find outliers frames in multiindex Dataframe

The result should be a mi-dataframe that does not contain any outliers.The criterion is the standard deviation: np.abs(x-g_mean) <= 3*g_std

My attempt to identify the statistical outliers:

import pandas as pd
import numpy as np

#create sample
arrays = [[1,1,1,2,2,2,3,3],
          [0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])

#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()

#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)

The last line in my code does not work, because in the last group I compare two different forms of dataframes. Any suggsestions?

You can use:

g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')

def f(x):
    #remove first level per group
    x = x.reset_index(level=0, drop=True)
    #detect outliers and check if all values are Trues   
    m = (np.abs(x-g_mean) <= 3*g_std).values.all()
    return m

#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1     True
2     True
3    False
dtype: bool

#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
                Ts        Tf
ID INDEX                    
1  0      0.612077  0.876833
   1      0.911303  0.377008
   2      0.326670  0.289647
2  0      0.525381  0.599262
   1      1.336077  1.177081
   2      1.322341  0.572035

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM