[英]how to calculate the mean of a variable between two date
I would like to calculate the mean of a variable between two date, below is the reproducible data frame. 我想计算两个日期之间的变量平均值,以下是可重现的数据框。
year <- c(1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,
1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,1996,
1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,
1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997,1997)
month <- c("JAN","FEB","MAR","APR","MAY","JUN","JUL","AUG","SEP","OCT","NOV","DEC")
station <- c("A","A","A","A","A","A","A","A","A","A","A","A",
"B","B","B","B","B","B","B","B","B","B","B","B")
concentration <- as.numeric(round(runif(48,20,40),1))
df <- data.frame(year,month,station,concentration)
id <- c(1,2,3,4)
station1996 <- c("A","A","B","B")
station1997 <- c("B","A","A","B")
start <- c("06/01/1996","07/01/1996","07/01/1996","08/01/1996")
end <- c("04/01/1997","04/01/1997","04/01/1997","05/01/1997")
participant <- data.frame(id,station1996,station1997,start,end)
participant$start <- as.Date(participant$start, format = "%m/%d/%Y")
participant$end <- as.Date(participant$end, format = "%m/%d/%Y")
So I have two dataset as below 所以我有两个数据集如下
df
year month station concentration
1 1996 JAN A 24.4
2 1996 FEB A 37.0
3 1996 MAR A 39.5
4 1996 APR A 28.0
...
45 1997 SEP B 37.7
46 1997 OCT B 35.2
47 1997 NOV B 26.8
48 1997 DEC B 40.0
participant
id station1996 station1997 start end
1 1 A B 1996-06-01 1997-04-01
2 2 A A 1996-07-01 1997-04-01
3 3 B A 1996-07-01 1997-04-01
4 4 B B 1996-08-01 1997-05-01
For each id, I would like to calculate the average concentration between the start and end date (month year). 对于每个ID,我想计算开始日期和结束日期(月份)之间的平均浓度。 Noted that the station might change between years. 注意,该站可能会在几年之间变化。
For example for id=1, I would like to calculate the average concentration between JUN 1996 AND APR 1997. This should be based on the concentration from JUN 1996 to DEC 1996 at station A, and JAN 1997 to APR 1997 at station B. 例如,对于id = 1,我想计算1996年6月到1997年4月之间的平均浓度。这应该基于A站从1996年6月到1996年12月以及B站从1997年1月到1997年4月的浓度。
Can anyone help? 有人可以帮忙吗?
Thank you very much. 非常感谢你。
Here's a data.table solution. 这是一个data.table解决方案。 The basic idea is to enumerate all the dates in the start-end range as yearmon
, for each id
, and then use that as an index into the concentration table df
. 基本思想是将每个id
的开始-结束范围内的所有日期枚举为yearmon
,然后将其用作浓度表df
的索引。 It's a bit convoluted so hopefully someone will come along and show you a simpler way. 这有点令人费解,所以希望有人会来给您展示一种更简单的方法。
library(data.table)
library(zoo) # for as.yearmon(...)
setDT(df) # convert to data.table
setDT(participant)
df[, yrmon:= as.yearmon(paste(year,month,sep="-"), format="%Y-%B")] # add year-month column
p.melt <- reshape(participant, varying=2:3, direction="long", sep="", timevar="year")
x <- participant[, .(date=seq(start,end,by="month")), by=id]
x[, c("year","yrmon"):=.(year(date),as.yearmon(date))] # add year and year-month
x[p.melt, station:=station, on=c("id","year")] # add station
x[df, conc:= concentration, on=c("yrmon","station"), nomatch=0] # add concentration
setorder(x,id) # not necessary, but makes it easier to interpret x
result <- x[, .(mean.conc=mean(conc)), by=id] # mean(conc) by id
result
# id mean.conc
# 1: 1 28.61818
# 2: 2 28.56000
# 3: 3 28.44000
# 4: 4 29.60000
So, first we convert everything to data.tables. 因此,首先我们将所有内容都转换为data.tables。 Then we add a yrmon
column to df
for indexing later. 然后,我们将yrmon
列添加到df
以便以后进行索引。 Then we create p.melt
by reshaping participant
to long format, so that the station is in one column and the indicator (1996 or 1997) is in a separate column. 然后,我们通过将participant
重塑为长格式来创建p.melt
,以便站点位于一列中,而指标(1996或1997)位于单独的列中。 Then we create a temporary table, x
with the sequence of dates for each id
, and add the year and yrmon for each of those dates. 然后,我们创建一个临时表x
,其中每个id
带有日期序列,并为每个日期添加year和yrmon。 Then we merge that with p.melt
on id
and year
to add a station column to x
. 然后,将其与id
和year
p.melt
合并,以将桩号列添加到x
。 Then we use yrmon
and station
to merge x
with df
to get the appropriate concentration. 然后,使用yrmon
和station
将x
与df
合并以获得适当的浓度。 Then we simply aggregate conc
by id
in x
using mean(...)
. 然后我们简单地使用mean(...)
在x
通过id
聚合conc
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.