简体   繁体   English

在 data.table 中按组查找时间序列的开始和结束 idx

[英]find start and end idx of a time series by group in a data table

I have data.table that looks like this:我有 data.table 看起来像这样:

data <- data.table(time = c(0, 1, 2, 3, 4, 5, 6, 7),
                   anom = c(0, 0, 1, 1, 1, 0, 0, 0),
                   gier = c(0, 0, 4, 9, 7, 0, 0, 0))

Now I am calculating some statistical values of the column gier grouped by column anom like this:现在我正在计算按列anom分组的gier列的一些统计值,如下所示:

cols <- c("gier")
statFun <- function(x) list(mean = mean(x), median = median(x), std = sd(x))
statSum <- data[, unlist(lapply(.SD, statFun), recursive = FALSE), .SDcols = cols, by = anom]

This is fine but I want to go a step further and put in the start and end points of time depending on the start and of the anom groups (0 and 1).这很好,但我想anom更进一步,并根据异常组(0 和 1)的开始和结束time输入开始和结束时间点。 So in the end I have something like a new time series but only with the start and end points of time .所以最后我有一个新的时间序列,但只有time的开始和结束点。 So in the end the result should look like this:所以最后的结果应该是这样的:

res <- data.table(x.start     = c(0, 2, 5),
                  x.end       = c(1, 4, 7),
                  anom        = c(0, 1, 0),
                  gier.mean   = c(0, 6.666, 0),
                  gier.median = c(0, 7, 0),
                  gier.std    = c(0, 2.516, 0))

How is it possible to achieve this?怎么可能做到这一点?

addition: is there a way to achieve the result for multiple columns and not only one column like gier ?另外:有没有办法实现多列的结果,而不是像gier这样的只有一列的结果? For example I am able to do this but I don't know how to extend it with the mentioned columns.例如,我能够做到这一点,但我不知道如何使用提到的列来扩展它。 This way there is at least an extra column rn for the column names I calculate the statistical values.这样我计算统计值的列名至少多了一个列rn

res <- data[, setDT(do.call(rbind.data.frame, lapply(.SD, statFun)), keep.rownames = TRUE), .SDcols = cols, by = anom]

You can include additional calculation outside lapply :您可以在lapply之外包括额外的计算:

library(data.table)

data[, unlist(c(lapply(.SD, statFun), 
              anom = first(anom), x.start = first(time), x.end = last(time)), 
              recursive = FALSE), rleid(anom), .SDcols = cols]

#   rleid gier.mean gier.median gier.std anom x.start x.end
#1:     1  0.000000           0 0.000000    0       0     1
#2:     2  6.666667           7 2.516611    1       2     4
#3:     3  0.000000           0 0.000000    0       5     7

In dplyr we can do this similarly:dplyr ,我们可以类似地这样做:

library(dplyr)

data %>%
  group_by(grp = rleid(anom)) %>%
  summarise(across(cols, list(mean = mean, median = median, std = sd)), 
            x.start = first(time), 
            x.end = last(time))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM