简体   繁体   English

如果与前一行的差异高于阈值,则 Pandas 替换为 NaN

[英]Pandas replace by NaN if the difference with the previous row is above a treshold

I have an half hourly dataframe df from which i want to remove outliers.我有一个半小时的 dataframe df,我想从中删除异常值。

date  = ['2015-02-03 23:00:00','2015-02-03 23:30:00','2015-02-04 00:00:00','2015-02-04 00:30:00']
value_column = [33.24  , 500  , 34.39  , 34.49 ]

df = pd.DataFrame({'value column':value_column})
df.index = pd.to_datetime(df['index'],format='%Y-%m-%d %H:%M')
df.drop(['index'],axis=1,inplace=True)

print(df.head())
                   value column  
index                                     
2015-02-03 23:00:00  33.24   
2015-02-03 23:30:00  500   
2015-02-04 00:00:00  34.39   
2015-02-04 00:30:00  34.49   

I want to remove outliers based on the difference of the values from one hour to the next.我想根据一小时到下一小时的值差异来删除异常值。 I would like to replace outliers values by NaN if the absolute difference from one hour to the next is above a given treshold.如果从一小时到下一小时的绝对差异高于给定阈值,我想用 NaN 替换异常值。 How can I do that efficiently?我怎样才能有效地做到这一点?

I know that I can get the difference of the dataframe with the line below, however I do not know how to replace values by nan at the identified indexes where the difference is above the given treshold.我知道我可以通过下面的行获得 dataframe 的差异,但是我不知道如何在差异高于给定阈值的已识别索引处用 nan 替换值。 Any idea on how to do that efficiently?关于如何有效地做到这一点的任何想法? (Assuming for instance that the treshold is 100) (例如假设阈值为 100)

df = df.diff()

I have tried the following, it does not throw any error but does not work:我尝试了以下方法,它不会引发任何错误但不起作用:

df["value column"]=df["value column"].mask(df["value column"].diff().abs() > 100, np.nan) 

Expected results:预期成绩:

                   value column  
index                                     
2015-02-03 23:00:00  33.24   
2015-02-03 23:30:00  NaN   
2015-02-04 00:00:00  34.39   
2015-02-04 00:30:00  34.49   

You need to find the do diff from top and bot together or the row number 3 will be droped as well您需要从 top 和 bot 一起找到 do diff,否则第 3 行也将被删除

df["value column"].mask((df["value column"].diff(-1).abs()>100) & (df["value column"].diff().abs() > 100), np.nan) 
Out[270]: 
0    33.24
1      NaN
2    34.39
3    34.49
Name: value column, dtype: float64

One strategy would be to append the df.diff() values as a new column to your dataframe and then use the df.apply() method in every row to return either the original row value or NaN depending on the value of the newly appended diff column.一种策略是将 append df.diff()值作为 dataframe 的新列,然后在每一行中使用df.apply()方法返回原始行值或 NaN,具体取决于新附加的值差异列。 Keep in mind that the df.diff() will return NaN for the first row so you need to manually account for that in the "selection function" in your apply function.请记住, df.diff()将为第一行返回 NaN,因此您需要在应用 function 的“选择函数”中手动考虑这一点。

df['diff'] = df.diff()
df['value column'] = df.apply(lambda x: x[0] if x[-1]<=100 or np.isnan(x[-1]) else np.nan , axis=1)
df

Results:结果:

                     value column
index                            
2015-02-03 23:00:00         33.24
2015-02-03 23:30:00           NaN
2015-02-04 00:00:00         34.39
2015-02-04 00:30:00         34.49

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM