简体   繁体   English

R:如何找到重叠数据点且没有缺失数据的最长周期?

[英]R: How to find longest periods with overlapping data points and no missing data?

I have a very large time series dataset of electricity load from a substation which has been cleaned to have consistent time intervals of 15 minutes, however there are still large periods of missing data.我有一个非常大的来自变电站的电力负荷时间序列数据集,该数据集已被清理为具有一致的 15 分钟时间间隔,但是仍然存在大量缺失数据。 The substation is split into individual feeders so is in the form:变电站被分成单独的馈线,因此形式如下:

Feeder <- c("F1","F1","F1","F1","F1", "F2","F2","F2","F2","F2", "F3","F3","F3","F3","F3")
Load <- c(3.1, NA, 4.0, 3.8, 3.6, 2.1, NA, 2.6, 2.9, 3.0, 2.4, NA, 2.3, 2.2, 2.5)

start <- as.POSIXct("2016-01-12 23:15:00")
end <- as.POSIXct("2016-01-13 00:15:00")
DateTimeseq <- seq(start, end, by = "15 min")
DateTime <- c(DateTimeseq, DateTimeseq, DateTimeseq)

dt <- data.frame(Feeder, Load, DateTime)

My actual data spans over a period of multiple years but I have condensed it down so it is easily replicable.我的实际数据跨越多年,但我已将其浓缩,因此很容易复制。 As you can see, there are missing values.如您所见,存在缺失值。 My actual dataset has large periods of missing data.我的实际数据集有大量缺失数据。 In order to perform effective analysis, I need to find periods where there are no missing load data points for all feeders (ie. longest overlapping periods).为了进行有效的分析,我需要找到所有馈线都没有丢失负载数据点的时段(即最长的重叠时段)。 If possible, I would like to generate a list of the longest overlapping periods without any NA values with the minimum being around 24 hours (I know this is not possible for the example I give but if you could show me how that would be great.).如果可能的话,我想生成一个没有任何 NA 值的最长重叠周期列表,最小值约为 24 小时(我知道这对于我给出的示例来说是不可能的,但如果你能告诉我这将是多么棒。 )。 You could use a minimum of 15 minutes or something in this example.在此示例中,您可以使用至少 15 分钟或其他时间。

As you can see from the simple data, the longest period would be 30 minutes between 2016-01-12 23:45:00 and 2016-01-13 00:15:00.从简单的数据可以看出,2016-01-12 23:45:00 到 2016-01-13 00:15:00 之间的最长周期为 30 分钟。 However, in this example the second longest period would be 15 minutes but is inside the longest period.但是,在此示例中,第二长的时间段将是 15 分钟,但在最长的时间段内。 If possible, I would like to run it so it doesn't replicate values.如果可能的话,我想运行它,这样它就不会复制值。 If so, the second longest period in this case would be the overlapping point at 2016-01-12 23:15:00.如果是这样,在这种情况下,第二长的时间段将是 2016-01-12 23:15:00 的重叠点。

Feel free to play around with it and add more values if it would make it easier.随意玩弄它并添加更多值,如果它会使它更容易的话。 It may be beneficial to create individual columns for the different feeders.为不同的馈线创建单独的列可能是有益的。 I usually use pipes from dplyr but this is not essential.我通常使用 dplyr 的管道,但这不是必需的。 If you require anymore information do not hesitate to ask.如果您需要更多信息,请随时询问。

Thanks!谢谢!

Perhaps, this will give you a start.也许,这会给你一个开始。 For each Feeder you can create groups between NA values., calculate their first and last value and create a 15-minute sequence between them.对于每个Feeder ,您可以在NA值之间创建组,计算它们的第一个和最后一个值,并在它们之间创建一个 15 分钟的序列。 You can then count which interval occur the most in the data.然后,您可以count数据中哪个区间出现的次数最多。

library(dplyr)

dt %>%
  group_by(Feeder) %>%
  group_by(grp = cumsum(is.na(Load)), .add = TRUE) %>%
  #Use add = TRUE in old dplyr
  #group_by(grp = cumsum(is.na(Load)), add = TRUE) %>%
  summarise(start = first(DateTime), 
            end = last(DateTime)) %>%
  ungroup %>%
  mutate(datetime = purrr::map2(start, end, seq, by = '15 mins')) %>%
  tidyr::unnest(datetime) %>%
  select(-start, -end) %>%
  count(datetime, sort = TRUE)

Base R solution:基础 R 解决方案:

# Strategy 1 contiguous period classification:
data.frame(do.call("rbind", lapply(split(dt, dt$Feeder), function(x){
    y <- with(x, x[order(DateTime),])
    y$category <- paste0(y$Feeder, ":", cumsum(is.na(y$Load)) + 1)
    tmp <- y[!(is.na(y$Load)),]
    cat_diff <- do.call("rbind", lapply(split(tmp, tmp$category), 
                function(z){
                  data.frame(category = unique(z$category), 
                    max_diff = difftime(max(z$DateTime),
                                        min(z$DateTime), 
                                        units = "hours"))}))
    y$max_diff <- cat_diff$max_diff[match(y$category, cat_diff$category)] 
    return(y)
      }
    )
  ), row.names = NULL
)

Here is another option to cast into a wide table and check for consecutive rows without any NAs:这是转换为宽表并检查没有任何 NA 的连续行的另一个选项:

library(data.table)

wDT <- dcast(setDT(dt)[, na := +is.na(Load)], DateTime ~ Feeder, value.var="na")

wDT[, c("ri", "rr") := {
    ri <- rleid(rowSums(.SD)==0L)
    .(ri, rowid(ri))
}, .SDcols=names(wDT)[-1L]]
range(wDT[ri %in% ri[rr==max(rr)]]$DateTime)
#[1] "2016-01-12 23:45:00 +08" "2016-01-13 00:15:00 +08"

I might have a nice 3 lines of code solution for you:我可能会为您提供一个不错的 3 行代码解决方案:

  1. First bringt the data into wide format, that each Feeder is a column首先将数据转换为宽格式,每个 Feeder 都是一列
  2. Check row wise (which is now timestamp wise), that all Feeders are non-NA.检查行明智(现在是时间戳明智),所有馈线都是非 NA。 This gives something like 12:15 TRUE, 12:30 TRUE, 12:45 FALSE,... FALSE in this context means all Feeders are available for this timestamp这给出了类似 12:15 TRUE, 12:30 TRUE, 12:45 FALSE,... FALSE 在这种情况下意味着所有馈线都可用于此时间戳
  3. Do a run length encoding on the resulting True,True,False,False,... series - this enables finding what you call consecutive overlapping periods对生成的 True,True,False,False,... 系列进行运行长度编码 - 这可以找到您所谓的连续重叠周期

Code:代码:

 library("tidyr")
 library("dplyr")
 # Into wide format
 dt_wide <- dt %>% pivot_wider(names_from = Feeder, values_from = Load)

 # Check if complete row is available
  dt_anyna <- apply(y,1, anyNA)
 
 # Now we need to find the longest FALSE runs
  rle(dt_anyna)

This gives you a run length encoding, that looks the following这为您提供了运行长度编码,如下所示

  Run Length Encoding
  lengths: int [1:3] 1 1 3
  values : logi [1:3] FALSE TRUE FALSE

Meaning at the beginning you have 1 False in a row, next 1 TRUE in a row, next 3 FALSE in a row.意思是一开始你连续有 1 个 False,接下来连续 1 个 TRUE,接下来连续 3 个 FALSE。

You can now easily work with this results.您现在可以轻松地处理此结果。 You probably want to filter out the TRUE runs, because you are only looking for the longest run, where all data is available (these are the FALSE runs).您可能想要过滤掉 TRUE 运行,因为您只寻找最长的运行,其中所有数据都可用(这些是 FALSE 运行)。 Then you can look for the max() run and you can also look for eg runs > 4 (which would be 1h for your 15 mins data).然后您可以查找 max() 运行,您还可以查找例如运行 > 4(对于您的 15 分钟数据而言,这将是 1 小时)。

additional code for the question from Ellis来自埃利斯的问题的附加代码

rle <- rle(dt_anyna)
x <- data.frame(  value = rle$values, duration = rle$lengths)
x$start <- dt_wide$DateTime[(cumsum(x$duration)- x$duration)+1]
x$end <-  dt_wide$DateTime[cumsum(x$duration)]
x$duration_s <-  x$end - x$start
ordered <- x[order(x$duration, decreasing = TRUE),]  
filtered <- filter(ordered, value == FALSE)
filtered

So just resuming where we ended before - you can add yourself start / end times / duration / sort and filter with this code.因此,只需恢复我们之前结束的位置 - 您可以使用此代码添加自己的开始/结束时间/持续时间/排序和过滤。 (you now must also call library("dplyr") in the beginning) (您现在还必须在开始时调用 library("dplyr"))

The results would looks like this:结果如下所示:

value  duration   start                end                 duration_s
FALSE        3    2016-01-12 23:45:00 2016-01-13 00:15:00  1800 secs
FALSE        1    2016-01-12 23:15:00 2016-01-12 23:15:00     0 secs

This would give you a data.frame ordered by duration of consecutive non-NA segments with start and end times.这将为您提供一个按连续非 NA 段的持续时间排序的 data.frame,其中包含开始时间和结束时间。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM