[英]How to find outliers frames in multiindex Dataframe
結果應該是一個不包含任何異常值的 mi- np.abs(x-g_mean) <= 3*g_std
標准是標准偏差: np.abs(x-g_mean) <= 3*g_std
我試圖識別統計異常值:
import pandas as pd
import numpy as np
#create sample
arrays = [[1,1,1,2,2,2,3,3],
[0,1,2,0,1,2,0,1]]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples, names=['ID', 'INDEX'])
df = pd.DataFrame(np.abs(np.random.randn(8, 2)), index=index, columns=['Ts','Tf'])
#groupby index and learn from data
g = df.groupby(level='INDEX')
g_mean=g.mean()
g_std = g.std()
#groupby ID and look if some ID is an outlier
g = df.groupby(level='ID')
test = g.apply(lambda x: True if np.abs(x-g_mean) <= 3*g_std else False)
我的代碼中的最后一行不起作用,因為在最后一組中,我比較了兩種不同形式的數據幀。 有什么建議嗎?
您可以使用:
g_mean= df.mean(level='INDEX')
g_std = df.std(level='INDEX')
def f(x):
#remove first level per group
x = x.reset_index(level=0, drop=True)
#detect outliers and check if all values are Trues
m = (np.abs(x-g_mean) <= 3*g_std).values.all()
return m
#groupby ID and look if some ID is an outlier
s = df.groupby(level='ID').apply(f)
print (s)
ID
1 True
2 True
3 False
dtype: bool
#map second level by boolean Series and filter by boolean indexing
df = df[df.index.get_level_values('ID').to_series().map(s).values]
#if necessary, remove unnecessary levels in MultiIndex
df.index = df.index.remove_unused_levels()
print (df)
Ts Tf
ID INDEX
1 0 0.612077 0.876833
1 0.911303 0.377008
2 0.326670 0.289647
2 0 0.525381 0.599262
1 1.336077 1.177081
2 1.322341 0.572035
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.