简体   繁体   中英

R - find outliers in time series dataset using standard deviation

i have a xts time series object with numeric values for the data. str (dataTS)

An 'xts' object on 2014-02-14 14:27:00/2014-02-28 14:22:00 containing: Data: num [1:4032, 1] 51.8 44.5 41.2 48.6 46.7 ... Indexed by objects of class: [POSIXlt,POSIXt] TZ: xts Attributes:
NULL

I want to find the data points that are more than (2 * sd) away from mean. I would like to create an new dataset from it.

[,1] 2015-02-14 14:27:00 51.846 2015-02-14 14:32:00 44.508 2016-02-14 14:37:00 41.244 2015-02-14 14:42:00 48.568 2015-02-14 14:47:00 46.714 2015-02-14 14:52:00 44.986 2015-02-14 14:57:00 49.108 2015-02-14 15:02:00 1000.470 2015-02-14 15:07:00 53.404 2015-02-14 15:12:00 45.400 2015-02-14 15:17:00 3.216 2015-02-14 15:22:00 49.7204

the time series. i want to subset the outliers 3.216 and 1000.470

You can scale your data to have zero mean and unit standard deviation. You can then directly identify individual observations that are >= 2 sd away from the mean.

As an example, I randomly sample some data from a Cauchy distribution.

set.seed(2010);
smpl <- rcauchy(10, location = 4, scale = 3);

To illustrate, I store the sample data and scaled sample data in a data.frame ; I also mark observations that are >= 2 standard deviations away from the mean.

library(tidyverse);
df <- data.frame(Data = smpl) %>%
    mutate(
        Data.scaled = as.numeric(scale(Data)),
        deviation_greater_than_2sd = ifelse(Data.scaled >= 2, TRUE, FALSE));
df;
#         Data Data.scaled deviation_greater_than_2sd
#1    8.007951  -0.2639689                      FALSE
#2  -34.072054  -0.5491882                      FALSE
#3  465.099800   2.8342104                       TRUE
#4    7.191778  -0.2695010                      FALSE
#5    2.383882  -0.3020890                      FALSE
#6    3.544079  -0.2942252                      FALSE
#7   -7.002769  -0.3657119                      FALSE
#8    4.384503  -0.2885287                      FALSE
#9   15.722492  -0.2116796                      FALSE
#10   4.268082  -0.2893179                      FALSE

We can also visualise the distribution of Data.scaled :

ggplot(df, aes(Data.scaled)) + geom_histogram();

在此输入图像描述

The "outlier" is 2.8 units of standard deviation away from the mean.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM