簡體   English   中英

如何在多索引數據幀中查找異常值幀

[英]How to find outliers frames in multiindex Dataframe

結果應該是一個不包含任何異常值的 mi- np.abs(x-g_mean) <= 3*g_std標准是標准偏差: np.abs(x-g_mean) <= 3*g_std

我試圖識別統計異常值:

import pandas as pd
import numpy as np

#create sample
arrays = [[1,1,1,2,2,2,3,3],
          [0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])

#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()

#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)

我的代碼中的最后一行不起作用,因為在最后一組中,我比較了兩種不同形式的數據幀。 有什么建議嗎?

您可以使用:

g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')

def f(x):
    #remove first level per group
    x = x.reset_index(level=0, drop=True)
    #detect outliers and check if all values are Trues   
    m = (np.abs(x-g_mean) <= 3*g_std).values.all()
    return m

#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1     True
2     True
3    False
dtype: bool

#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
                Ts        Tf
ID INDEX                    
1  0      0.612077  0.876833
   1      0.911303  0.377008
   2      0.326670  0.289647
2  0      0.525381  0.599262
   1      1.336077  1.177081
   2      1.322341  0.572035

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM