[英]Identify start date, end date, length of run of consecutive number, and transpose into new data frame
I have a set of data that looks like this: 我有一组看起来像这样的数据:
Date boolean
407 2006-06-01 1
408 2006-06-02 1
409 2006-06-03 1
410 2006-06-04 NA
411 2006-06-05 0
412 2006-06-06 1
413 2006-06-07 1
414 2006-06-08 0
415 2006-06-09 1
From this, I am trying to create a new data frame that will show me what dates my runs of 1's occur as well as how long these runs are, with the column headers: 1) start date, 2) end date, and 3) length of run. 从这开始,我正在尝试创建一个新的数据框,它将显示我的1次运行的日期以及这些运行的时间长度,以及列标题:1)开始日期,2)结束日期和3)跑步的长度。
Ultimately, I want to create a data frame that looks like this from the data I have above: 最后,我想从上面的数据创建一个看起来像这样的数据框:
Start Date End Date Length of Run
1 2006-06-01 2006-06-03 3
2 2006-06-06 2006-06-07 2
I have a few NA's in my data that I need to ignore throughout my data as well. 我的数据中有一些NA,我需要在整个数据中忽略它。
You could do this with dplyr
, using mutate
to convert missing boolean
values to 0, group_by
to compute groups with constant values of variable boolean
, filter
to limit to groups where boolean
was set to 1 and where the group had more than one member, and then summarize
to grab the relevant summary information. 您可以使用
dplyr
执行此dplyr
,使用mutate
将缺少的boolean
值转换为0,将group_by
转换为计算具有常量变量boolean
值的组,将filter
限制为boolean
设置为1的组以及组具有多个成员的组,以及然后summarize
一下获取相关的摘要信息。 (I take a few extra steps to remove the grouping variable at the end). (我采取了一些额外的步骤来删除最后的分组变量)。
library(dplyr)
dat %>%
mutate(boolean = ifelse(is.na(boolean), 0, boolean)) %>%
group_by(group = cumsum(c(0, diff(boolean) != 0))) %>%
filter(boolean == 1 & n() > 1) %>%
summarize("Start Date"=min(as.character(Date)),
"End Date"=max(as.character(Date)),
"Length of Run"=n()) %>%
ungroup() %>%
select(-matches("group"))
# Start Date End Date Length of Run
# (chr) (chr) (int)
# 1 2006-06-01 2006-06-03 3
# 2 2006-06-06 2006-06-07 2
Data: 数据:
dat <- read.table(text=" Date boolean
407 2006-06-01 1
408 2006-06-02 1
409 2006-06-03 1
410 2006-06-04 NA
411 2006-06-05 0
412 2006-06-06 1
413 2006-06-07 1
414 2006-06-08 0
415 2006-06-09 1", header=T)
We can also use data.table
to subset and cast the data as needed. 我们还可以使用
data.table
根据需要对数据进行子集化和转换。 First we create an id
column with rleid(boolean)
. 首先,我们使用
rleid(boolean)
创建一个id
列。 Next, subset the data according to the necessary conditions. 接下来,根据必要条件对数据进行子集化。 Lastly, we create
start
, end
, and run
with the subsetted data: 最后,我们使用子集化数据创建
start
, end
和run
:
library(data.table)
setDT(dat)[,id := rleid(boolean)][
,.SD[.N > 1 & boolean == 1],id][
,.(start=Date[1],end=Date[.N], run=.N),id]
# id start end run
#1: 1 2006-06-01 2006-06-03 3
#2: 4 2006-06-06 2006-06-07 2
Another answer using base, reformatting this answer 's use of cumsum
and diff
. 使用base的另一个答案,重新格式化这个答案的使用
cumsum
和diff
。
#Remove ineligible dates (defined by 0 or NA)
x1 <- x[!(x$boolean %in% c(NA, 0)), ]
x1$Date <- as.Date(x1$Date) #Convert date from factor to Date class
#Starting at 0, if the difference between eligible dates is >1 day,
# add 1 (TRUE) to the previous value, else add 0 (FALSE) to previous value
#This consecutively numbers each series
x1$SeriesNo <- cumsum(c(0, diff(x1$Date) > 1))
# Date boolean SeriesNo
#407 2006-06-01 1 0
#408 2006-06-02 1 0
#409 2006-06-03 1 0
#412 2006-06-06 1 1
#413 2006-06-07 1 1
#415 2006-06-09 1 2
# Aggregate: Perform the function FUN on variable Date by each SeriesNo group
x2 <- as.data.frame(as.list(
aggregate(Date ~ SeriesNo, data= x1, FUN=function(zz)
c(Start = min(zz), End= max(zz), Run = length(zz) ))
)) #see note after this code block
#Output is in days since origin. Reconvert them into Date class
x2$Date.Start <- as.Date(x2$Date.Start, origin = "1970-01-01")
x2$Date.End <- as.Date(x2$Date.End, origin = "1970-01-01")
# SeriesNo Date.Start Date.End Date.Run
#1 0 2006-06-01 2006-06-03 3
#2 1 2006-06-06 2006-06-07 2
#3 2 2006-06-09 2006-06-09 1
A note on "buggy" output from aggregate
: Using aggregate to apply several functions on several variables in one call 关于
aggregate
“buggy”输出的注释: 使用聚合在一个调用中对几个变量应用多个函数
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.