trying to calculate number of cases between 2 dates, there is table that contains a number and time interval, i want to create output table with dates and sum of cases. The easy (and resolved) problem is:
df <- data.frame(person = c("A", "B", "C"), start = c("2014-01-01", "2014-01-03", "2014-01-04"), stop = c("2014-01-02", "2014-01-06", "2014-01-04") )
df
f1 = function() { #keeping dates
as.data.frame(table(unlist(apply(df[-1], 1,
function(x) as.character(seq(as.Date(x[1], "%Y-%m-%d"),
as.Date(x[2], "%Y-%m-%d"), "1 day"))))))}
f1()
and it would return
Var1 Freq
1 2014-01-01 1
2 2014-01-02 1
3 2014-01-03 1
4 2014-01-04 2
5 2014-01-05 1
6 2014-01-06 1
what i would need is to summarize the first column instead, with the input data like this
df <- data.frame(cases = c(5, 2, 2), start = c("2014-01-01", "2014-01-03", "2014-01-04"), stop = c("2014-01-02", "2014-01-06", "2014-01-04") )
it should return
Var1 cases
1 2014-01-01 5
2 2014-01-02 5
3 2014-01-03 2
4 2014-01-04 4
5 2014-01-05 2
6 2014-01-06 2
maybe it would not be even case for unlisting, what can i potentially use to calculate number of cases per day? and is there any way to show 0 values if the date is valid between start and end but does not have any occurrences in the data
EDIT
Aichao's answer is what i needed - the only missing bit is to get 0 sums for example in
df <- data.frame(cases = c(5, 2, 2),
start = c("2014-01-01", "2014-01-04", "2014-01-04"),
stop = c("2014-01-02", "2014-01-06", "2014-01-04") )
to get
Var1 x
1 2014-01-01 5
2 2014-01-02 5
3 2014-01-03 0
4 2014-01-04 4
5 2014-01-05 2
6 2014-01-06 2
Here is a solution that is in-line with what you are doing with f1
:
f2 <- function(df) {
df2 <- do.call(rbind, lapply(1:nrow(df), function(i) {
Var1 <- as.character(seq(as.Date(df$start[i],format="%Y-%m-%d"),
as.Date(df$stop[i],format="%Y-%m-%d"),"day"))
cases <- rep(df$cases[i],length(Var1))
data.frame(Var1,cases)
}))
aggregate(df2[,-1], by=list(Var1=df2[,1]), FUN=sum)
}
In f2
:
df2
from the sequence of dates from df$start
to df$stop
for each row in df
. Here, lapply
is used to loop over each row of df
and the cases
for each row is repeated to match the length
of the resulting sequence of dates. Then use rbind
to combine each of these data frames by rows. aggregate
(from the stats
package) this df2
by dates (ie, Var1
) and sum
up the cases
. With your data:
f2(df)
## Var1 x
##1 2014-01-01 5
##2 2014-01-02 5
##3 2014-01-03 2
##4 2014-01-04 4
##5 2014-01-05 2
##6 2014-01-06 2
One way to fill in the missing dates with 0
as cases
is to take the aggregated results from the above solution and creating a new sequence of dates spanning the range of dates. This will create the Var1
column for the new output. Then, it is a matter of copying the cases from the old result to the new output that matches on date:
f2 <- function(df) {
df2 <- do.call(rbind, lapply(1:nrow(df), function(i) {
## note that we do not convert to characters here because we want to use these later to form the sequence
Var1 <- seq(as.Date(df$start[i],format="%Y-%m-%d"),
as.Date(df$stop[i],format="%Y-%m-%d"),"day")
cases <- rep(df$cases[i],length(Var1))
data.frame(Var1,cases)
}))
df2 <- aggregate(df2[,-1], by=list(Var1=df2[,1]), FUN=sum)
## sort previous result by date
df2 <- df2[order(df2[,1]),]
## create new sequence spanning range
Var1 <- as.character(seq(df2[1,1],df2[nrow(df2),1],"day"))
## create cases of zeros matching Var1 in length
cases <- rep(0,length(Var1))
## copy over cases from previous result that matches date
cases[na.omit(match(as.character(df2[,1]),Var1))] <- df2[,2]
## output as data frame
data.frame(Var1,cases)
}
On your updated data:
f2(df)
## Var1 cases
##1 2014-01-01 5
##2 2014-01-02 5
##3 2014-01-03 0
##4 2014-01-04 4
##5 2014-01-05 2
##6 2014-01-06 2
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.