简体   繁体   English

计算连续值的延伸

[英]Calculating stretches of consecutive values

I have a df with two columns of interest: Date and Quality. 我有一个有两个感兴趣的列的df:日期和质量。 Date is a daily time series. 日期是每日时间序列。 There are three options for quality - Good, Estimated, Missing. 质量有三种选择 - Good,Estimated,Missing。 With one of these options being associated with a given date. 其中一个选项与给定日期相关联。

I would like to retrieve two pieces of information: (1) is a list of consecutive stretches an option has over the time series; 我想检索两条信息:(1)是一个选项在时间序列上的连续延伸列表; and (2) the dates associated with those consecutive records. (2)与这些连续记录相关的日期。

For example, 例如,

1900-01-01  Good
1900-01-02  Good
1900-01-03  Good
1900-01-04  Estimated
1900-01-05  Good
1900-01-06  Good
1900-01-07  Estimated
1900-01-08  Good

So here we for Good we would have a consecutive list of 3,2,1 and I would like to return a date list of 1900-01-01 to 1900-01-03, 1900-01-05 to 1900-01-06 and 1900-01-08 associated with the 3,2,1 list. 所以在这里我们为Good我们将有一个连续的3,2,1列表,我想将1900-01-01的日期列表返回到1900-01-03,1900-01-05到1900-01-06和1900-01-08相关的3,2,1列表。

You can use rle 你可以使用rle

Below sections shows the consecutive lengths for Good 下面的部分显示了Good的连续长度

encodes <- rle(df$Quality)
encodes$lengths[encodes$values == "Good"]
[1] 3 2 1

Getting the dates can be done directly from the df 获取日期可以直接从df

Data: 数据:

df <- read.table(text = "Date Quality
1900-01-01  Good
1900-01-02  Good
                 1900-01-03  Good
                 1900-01-04  Estimated
                 1900-01-05  Good
                 1900-01-06  Good
                 1900-01-07  Estimated
                 1900-01-08  Good", header = T, stringsAsFactors = F)
library(data.table)
setDT(df)

out <- 
  df[order(Date), .(start = Date[1], end = Date[.N], .N), 
     by = .(Quality, id = rleid(Quality))][, -'id']

out[Quality == 'Good']
#    Quality      start        end N
# 1:    Good 1900-01-01 1900-01-03 3
# 2:    Good 1900-01-05 1900-01-06 2
# 3:    Good 1900-01-08 1900-01-08 1

Data used 使用的数据

df <- fread('
Date  Quality
1900-01-01  Good
1900-01-02  Good
1900-01-03  Good
1900-01-04  Estimated
1900-01-05  Good
1900-01-06  Good
1900-01-07  Estimated
1900-01-08  Good
')

df[, Date := as.Date(Date)]

One dplyr possibility could be: 一个dplyr可能是:

df %>%
 mutate(rleid = with(rle(V2), rep(seq_along(lengths), lengths)),
        V1 = as.Date(V1, format = "%Y-%m-%d")) %>%
 group_by(rleid, V2) %>%
 summarise(res = paste0(min(V1), ":", max(V1)))

  rleid V2        res                  
  <int> <chr>     <chr>                
1     1 Good      1900-01-01:1900-01-03
2     2 Estimated 1900-01-04:1900-01-04
3     3 Good      1900-01-05:1900-01-06
4     4 Estimated 1900-01-07:1900-01-07
5     5 Good      1900-01-08:1900-01-08

Or: 要么:

df %>%
 mutate(rleid = with(rle(V2), rep(seq_along(lengths), lengths)),
        V1 = as.Date(V1, format = "%Y-%m-%d")) %>%
 group_by(rleid, V2) %>%
 summarise(res = paste0(min(V1), ":", max(V1))) %>%
 group_by(V2) %>%
 mutate(rleid = seq_along(rleid)) %>%
 arrange(V2, rleid)

  rleid V2        res                  
  <int> <chr>     <chr>                
1     1 Estimated 1900-01-04:1900-01-04
2     2 Estimated 1900-01-07:1900-01-07
3     1 Good      1900-01-01:1900-01-03
4     2 Good      1900-01-05:1900-01-06
5     3 Good      1900-01-08:1900-01-08

Or: 要么:

df %>%
 mutate(rleid = with(rle(V2), rep(seq_along(lengths), lengths)),
        V1 = as.Date(V1, format = "%Y-%m-%d")) %>%
 group_by(rleid, V2) %>%
 summarise(res = paste0(min(V1), ":", max(V1)),
           n = n()) %>%
 group_by(V2) %>%
 mutate(rleid = seq_along(rleid)) %>%
 arrange(V2, rleid)

  rleid V2        res                       n
  <int> <chr>     <chr>                 <int>
1     1 Estimated 1900-01-04:1900-01-04     1
2     2 Estimated 1900-01-07:1900-01-07     1
3     1 Good      1900-01-01:1900-01-03     3
4     2 Good      1900-01-05:1900-01-06     2
5     3 Good      1900-01-08:1900-01-08     1

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM