简体   繁体   中英

Detection of time-series outliers

I'm working on a university project forecasting. I have a huge database with demand between two cities. However, I know that this dataset is contaminated. However, I do not know which data points are obscured. The dataset is a panel data set that follows demand between city pairs on a monthly basis. Below is a part of the data that I am working with.

      CAI.JED CAI.RUH ADD.DXB CAI.IST  ALG.IST
2013-01-01   19196   14777      16    1413      12
2013-02-01   19913       8   18203    1026       5
2013-03-01   34242   11751   17836     985       1
2013-04-01   23481   12000   13479     948      27
2013-05-01   24428   16046   16391     954       9
2013-06-01   31791   23479   16571       1       4
2013-07-01   33716   20090   11323       0    5724
2013-08-01   35553       2   11121       0       0
2013-09-01   18746   13423   12119       0      26
2013-10-01      10   12223   10239       0       0
2013-11-01      19   20234   14231       5       2
2013-12-01   15198       1   12132      10       5

The dataset is a combination from two datasets. The persons that provided me the data told me that in some months, only one of the two dataset is working. However, it is not known for which months, which specific dataset is available.

Now comes my question: for the next part of the project, I need to get annual demand numbers. However, as I know that the figures are obscured, I would like to remove outliers. What techniques are available in R to do this?

As the data is in time-series format, I tried to use the tsoutliers package (see http://cran.r-project.org/web/packages/tsoutliers/tsoutliers.pdf ). However, I could not get this working. Also, I tried the suggestions from https://stats.stackexchange.com/questions/104882/detecting-outliers-in-time-series-ls-ao-tc-using-tsoutliers-package-in-r-how/104946#104946 , but it didn't work.

After knowing what the outliers are, I would like to either replace them (eg with the mean for that route), or if too many points are missing, I would like to reject the entire route from the dataset.

I prefer density based clustering algorithm such as DBSCAN. If you modify the epsilon and num-samples, you can filter outliers very specifically using a plot to visualize the result (label -1 are the outliers)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM