简体   繁体   English

时间序列中的异常值检测

[英]Outlier detection in time-series

I have a dataset in the following form:我有以下形式的数据集:

      timestamp   consumption
2017-01-01 00:00:00 14.3
2017-01-01 01:00:00 29.1
2017-01-01 02:00:00 28.7
2017-01-01 03:00:00 21.3
2017-01-01 04:00:00 18.4
... ... ...
2017-12-31 19:00:00 53.2
2017-12-31 20:00:00 43.5
2017-12-31 21:00:00 37.1
2017-12-31 22:00:00 35.8
2017-12-31 23:00:00 33.8

And I want to perform anomaly detection in the sense that it predicts abnormal high or low values.我想在预测异常高值或低值的意义上执行异常检测。

I am performing isolation forest :我正在执行isolation forest

IF = IsolationForest(random_state=0, contamination=0.005, n_estimators=200, max_samples=0.7)
IF.fit(model_data)

# New Outliers Column
data['Outliers'] = pd.Series(IF.predict(model_data)).apply(lambda x: 1 if x == -1 else 0)

# Get Anomaly Score
score = IF.decision_function(model_data)

# New Anomaly Score column
data['Score'] = score
data.head()

The result that I am getting as outliers is the following:我作为离群值得到的结果如下:

在此处输入图像描述

It seems that identifies the peaks, but it misses some low values that are apparently outliers and I have highlighted them in the plot.它似乎识别了峰值,但它遗漏了一些显然是异常值的低值,我在图中突出显示了它们。

Any idea of what is causing this error?知道是什么导致了这个错误吗?

The values highlighted in yellow seems to repeat themselves.以黄色突出显示的值似乎在重复。 So the model encountered these values several times, and won't consider them as outliers.所以模型多次遇到这些值,并且不会将它们视为异常值。 As you're training and testing your model on the same dataset, this is not very surprising, the model is overfitting.当您在同一个数据集上训练和测试您的模型时,这并不奇怪,模型会过度拟合。 Using a forest based algorithm for a univariate timeseries seems overkill to me.对单变量时间序列使用基于森林的算法对我来说似乎有点过分了。 I'll start with a simple algorithm computing a rolling mean and the standard deviation to find outliers before using anything more complex.我将从计算滚动平均值和标准差的简单算法开始,以便在使用任何更复杂的算法之前找到异常值。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM