简体   繁体   English

检测时间序列异常值

[英]Detection of time-series outliers

I'm working on a university project forecasting.我正在从事大学项目预测。 I have a huge database with demand between two cities.我有一个巨大的数据库,在两个城市之间有需求。 However, I know that this dataset is contaminated.但是,我知道这个数据集被污染了。 However, I do not know which data points are obscured.但是,我不知道哪些数据点被遮挡了。 The dataset is a panel data set that follows demand between city pairs on a monthly basis.该数据集是一个面板数据集,每月跟踪城市对之间的需求。 Below is a part of the data that I am working with.以下是我正在使用的部分数据。

      CAI.JED CAI.RUH ADD.DXB CAI.IST  ALG.IST
2013-01-01   19196   14777      16    1413      12
2013-02-01   19913       8   18203    1026       5
2013-03-01   34242   11751   17836     985       1
2013-04-01   23481   12000   13479     948      27
2013-05-01   24428   16046   16391     954       9
2013-06-01   31791   23479   16571       1       4
2013-07-01   33716   20090   11323       0    5724
2013-08-01   35553       2   11121       0       0
2013-09-01   18746   13423   12119       0      26
2013-10-01      10   12223   10239       0       0
2013-11-01      19   20234   14231       5       2
2013-12-01   15198       1   12132      10       5

The dataset is a combination from two datasets.数据集是来自两个数据集的组合。 The persons that provided me the data told me that in some months, only one of the two dataset is working.向我提供数据的人告诉我,在几个月内,两个数据集中只有一个是有效的。 However, it is not known for which months, which specific dataset is available.但是,不知道哪几个月,哪个特定数据集可用。

Now comes my question: for the next part of the project, I need to get annual demand numbers.现在我的问题是:对于项目的下一部分,我需要获得年度需求数字。 However, as I know that the figures are obscured, I would like to remove outliers.但是,我知道这些数字被模糊了,我想删除异常值。 What techniques are available in R to do this? R 中有哪些技术可以做到这一点?

As the data is in time-series format, I tried to use the tsoutliers package (see http://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf ).由于数据采用时间序列格式,我尝试使用 tsoutliers 包(参见http://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf )。 However, I could not get this working.但是,我无法使其正常工作。 Also, I tried the suggestions from https://stats.stackexchange.com/questions/104882/detecting-outliers-in-time-series-ls-ao-tc-using-tsoutliers-package-in-r-how/104946#104946 , but it didn't work.另外,我尝试了https://stats.stackexchange.com/questions/104882/detecting-outliers-in-time-series-ls-ao-tc-using-tsoutliers-package-in-r-how/104946的建议#104946 ,但没有用。

After knowing what the outliers are, I would like to either replace them (eg with the mean for that route), or if too many points are missing, I would like to reject the entire route from the dataset.在知道异常值是什么之后,我想替换它们(例如,用该路线的平均值),或者如果缺少太多点,我想从数据集中拒绝整个路线。

I prefer density based clustering algorithm such as DBSCAN.我更喜欢基于密度的聚类算法,例如 DBSCAN。 If you modify the epsilon and num-samples, you can filter outliers very specifically using a plot to visualize the result (label -1 are the outliers)如果您修改 epsilon 和 num-samples,您可以非常具体地使用绘图来过滤异常值以可视化结果(标签 -1 是异常值)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM