简体   繁体   English

确定开始日期,结束日期,连续数字的运行长度,并转换为新的数据框架

[英]Identify start date, end date, length of run of consecutive number, and transpose into new data frame

I have a set of data that looks like this: 我有一组看起来像这样的数据:

          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1

From this, I am trying to create a new data frame that will show me what dates my runs of 1's occur as well as how long these runs are, with the column headers: 1) start date, 2) end date, and 3) length of run. 从这开始,我正在尝试创建一个新的数据框,它将显示我的1次运行的日期以及这些运行的时间长度,以及列标题:1)开始日期,2)结束日期和3)跑步的长度。

Ultimately, I want to create a data frame that looks like this from the data I have above: 最后,我想从上面的数据创建一个看起来像这样的数据框:

  Start Date   End Date  Length of Run
1 2006-06-01 2006-06-03              3
2 2006-06-06 2006-06-07              2  

I have a few NA's in my data that I need to ignore throughout my data as well. 我的数据中有一些NA,我需要在整个数据中忽略它。

You could do this with dplyr , using mutate to convert missing boolean values to 0, group_by to compute groups with constant values of variable boolean , filter to limit to groups where boolean was set to 1 and where the group had more than one member, and then summarize to grab the relevant summary information. 您可以使用dplyr执行此dplyr ,使用mutate将缺少的boolean值转换为0,将group_by转换为计算具有常量变量boolean值的组,将filter限制为boolean设置为1的组以及组具有多个成员的组,以及然后summarize一下获取相关的摘要信息。 (I take a few extra steps to remove the grouping variable at the end). (我采取了一些额外的步骤来删除最后的分组变量)。

library(dplyr)
dat %>%
  mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
  group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
  filter(boolean == 1 & n() > 1) %>%
  summarize("Start Date"=min(as.character(Date)),
            "End Date"=max(as.character(Date)),
            "Length of Run"=n()) %>%
  ungroup() %>%
  select(-matches("group"))
#   Start Date   End Date Length of Run
#        (chr)      (chr)         (int)
# 1 2006-06-01 2006-06-03             3
# 2 2006-06-06 2006-06-07             2

Data: 数据:

dat <- read.table(text="          Date boolean
407 2006-06-01       1
408 2006-06-02       1
409 2006-06-03       1
410 2006-06-04      NA
411 2006-06-05       0
412 2006-06-06       1
413 2006-06-07       1
414 2006-06-08       0
415 2006-06-09       1", header=T)

We can also use data.table to subset and cast the data as needed. 我们还可以使用data.table根据需要对数据进行子集化和转换。 First we create an id column with rleid(boolean) . 首先,我们使用rleid(boolean)创建一个id列。 Next, subset the data according to the necessary conditions. 接下来,根据必要条件对数据进行子集化。 Lastly, we create start , end , and run with the subsetted data: 最后,我们使用子集化数据创建startendrun

library(data.table)
setDT(dat)[,id := rleid(boolean)][
  ,.SD[.N > 1 & boolean == 1],id][
  ,.(start=Date[1],end=Date[.N], run=.N),id]
#   id      start        end run
#1:  1 2006-06-01 2006-06-03   3
#2:  4 2006-06-06 2006-06-07   2

Another answer using base, reformatting this answer 's use of cumsum and diff . 使用base的另一个答案,重新格式化这个答案的使用cumsumdiff

#Remove ineligible dates (defined by 0 or NA)
x1 <- x[!(x$boolean %in% c(NA, 0)), ]

x1$Date <- as.Date(x1$Date)  #Convert date from factor to Date class

#Starting at 0, if the difference between eligible dates is >1 day, 
#   add 1 (TRUE) to the previous value, else add 0 (FALSE) to previous value
#This consecutively numbers each series
x1$SeriesNo <-  cumsum(c(0, diff(x1$Date) > 1))

#          Date boolean SeriesNo
#407 2006-06-01       1        0
#408 2006-06-02       1        0
#409 2006-06-03       1        0
#412 2006-06-06       1        1
#413 2006-06-07       1        1
#415 2006-06-09       1        2

# Aggregate: Perform the function FUN on variable Date by each SeriesNo group
x2 <-  as.data.frame(as.list(
         aggregate(Date ~ SeriesNo, data= x1, FUN=function(zz) 
         c(Start = min(zz), End= max(zz), Run = length(zz) ))
       )) #see note after this code block

#Output is in days since origin.  Reconvert them into Date class
x2$Date.Start <- as.Date(x2$Date.Start, origin = "1970-01-01")
x2$Date.End   <- as.Date(x2$Date.End,   origin = "1970-01-01")

#  SeriesNo Date.Start   Date.End Date.Run
#1        0 2006-06-01 2006-06-03        3
#2        1 2006-06-06 2006-06-07        2
#3        2 2006-06-09 2006-06-09        1

A note on "buggy" output from aggregate : Using aggregate to apply several functions on several variables in one call 关于aggregate “buggy”输出的注释: 使用聚合在一个调用中对几个变量应用多个函数

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM