[英]Date is within an interval and pull field
I have two df's: maindf
and list
. 我有两个df:
maindf
和list
。
ID <- c(1, 1, 1, 1, 5, 5)
SURVEY_DATE <- c("1997-08-01", "1998-08-20", "1998-11-20", "2000-12-13", "1998-05-02", "1998-12-25")
SURVEY_DATE <- as.Date(SURVEY_DATE)
maindf <- data.frame(ID, SURVEY_DATE)
maindf
ID <- c(1, 1, 1, 1, 5, 5)
ASSIGN_DATE <- c(1997, 1998, 1999, 2000, 1997, 1998)
TIME1 <- c("1997-07-23", "1998-11-17", "1999-12-15", "2000-12-11", "1998-04-07", "1998-12-06")
TIME1 <- as.Date(TIME1)
TIME2 <- c("1998-11-17", "1999-12-15", "2000-12-11", "2001-12-30", "1998-12-06", "1999-11-28")
TIME2 <- as.Date(TIME2)
list <- data.frame(ID, ASSIGN_DATE, TIME1, TIME2)
list
The maindf
has a SURVEY_DATE
field. maindf
有一个SURVEY_DATE
字段。 This field needs to check in the list
to see if it falls within TIME1
and TIME2
by ID
. 此字段需要检查
list
以查看其是否在ID
TIME1
和TIME2
之内。 If it does, I would like to pull the ASSIGN_DATE
into the maindf
. 如果是这样,我想将
ASSIGN_DATE
maindf
。
The final product should look like: 最终产品应如下所示:
ID SURVEY_DATE ASSIGN_DATE
1 1 1997-08-01 1997
2 1 1998-08-20 1997
3 1 1998-11-20 1998
4 1 2000-12-13 2000
5 5 1998-05-02 1997
6 5 1998-12-25 1998
I know this is very similar to this post and this post , but I'm having some trouble with pulling a field over by ID
. 我知道这与本帖子和本帖子非常相似,但是我在按
ID
提取字段时遇到了一些麻烦。
The OP has requested " to pull the ASSIGN_DATE
into the maindf
". OP已请求“ 将
ASSIGN_DATE
拉入maindf
”。
This can be achieved by an update join which modifies maindf
by reference : 这可以通过更新连接来实现,该连接 通过引用修改
maindf
:
library(data.table)
setDT(maindf)[setDT(list), on = .(ID, SURVEY_DATE >= TIME1, SURVEY_DATE <= TIME2),
ASSIGN_DATE := i.ASSIGN_DATE][]
ID SURVEY_DATE ASSIGN_DATE 1: 1 1997-08-01 1997 2: 1 1998-08-20 1997 3: 1 1998-11-20 1998 4: 1 2000-12-13 2000 5: 5 1998-05-02 1997 6: 5 1998-12-25 1998
I lack the ingenuity to come up with anything more creative that a for
loop right now, but at least this will get the job done: 我缺乏创造力,比现在的
for
循环更具创造力,但是至少这可以完成工作:
# recreate data (because I like lowercase)
maindf <- data.frame(
id = c(1, 1, 1, 1, 5, 5),
sdate = as.Date(c("1997-08-01", "1998-08-20", "1998-11-20", "2000-12-13", "1998-05-02", "1998-12-25")))
otherdf <- data.frame(
id = c(1, 1, 1, 1, 5, 5),
adate = c(1997, 1998, 1999, 2000, 1997, 1998),
time1 = as.Date(c("1997-07-23", "1998-11-17", "1999-12-15", "2000-12-11", "1998-04-07", "1998-12-06")),
time2 = as.Date(c("1998-11-17", "1999-12-15", "2000-12-11", "2001-12-30", "1998-12-06", "1999-11-28"))
)
# my sad loop
maindf$adate <- NA
for(i in 1:nrow(maindf)) {
c1 <- otherdf$id == maindf[i, "id"]
c2 <- otherdf$time1 < maindf[i, "sdate"]
c3 <- otherdf$time2 > maindf[i, "sdate"]
maindf[i, "adate"] <- otherdf[c1 & c2 & c3, "adate"]
}
Option 1: The data.table
way 选项1:
data.table
方式
Using data.table::foverlaps
使用
data.table::foverlaps
library(data.table)
setDT(maindf)[, `:=`(TIME1 = SURVEY_DATE, TIME2 = SURVEY_DATE)]
setDT(list)
# Interval-merge by TIME1 and TIME2
setkey(list, ID, TIME1, TIME2)
dt <- foverlaps(maindf, list)
# Clean up to reproduce expected output
dt[, .SD, .SDcols = c(names(maindf)[1:2], "ASSIGN_DATE")]
# ID SURVEY_DATE ASSIGN_DATE
#1: 1 1997-08-01 1997
#2: 1 1998-08-20 1997
#3: 1 1998-11-20 1998
#4: 1 2000-12-13 2000
#5: 5 1998-05-02 1997
#6: 5 1998-12-25 1998
Explanation: foverlaps
performs an overlap-join, based on the time intervals from two data.tables; 说明:
foverlaps
根据来自两个data.tables的时间间隔执行重叠连接。 foverlaps
requires a start and end time point in each data.table
, so we choose TIME1 = SURVEY_DATE
as the start and TIME2 = SURVEY_DATA
as the end point for maindf
. foverlaps
要求在每个开始和结束的时间点data.table
,所以我们选择TIME1 = SURVEY_DATE
为开端, TIME2 = SURVEY_DATA
作为终点maindf
。 foverlaps
needs to know the keys by which to merge (here ID
, TIME1
and TIME2
) for the second argument of foverlaps
which we set with setkey
. foverlaps
需要知道的钥匙由合并(这里ID
, TIME1
和TIME2
)为第二个参数foverlaps
我们设置setkey
。
Option 2: The tidyverse
/ fuzzyjoin
way 选项2:
tidyverse
/ fuzzyjoin
方法
Using fuzzyjoin::fuzzy_left_join
使用
fuzzyjoin::fuzzy_left_join
library(fuzzyjoin)
library(tidyverse)
maindf %>% mutate(SURVEY_DATE = as.Date(SURVEY_DATE)) %>%
fuzzy_left_join(
list %>% mutate_at(vars(starts_with("TIME")), as.Date),
by = c("ID" = "ID", "SURVEY_DATE" = "TIME1", "SURVEY_DATE" = "TIME2"),
match_fun = list(`==`, `>=`, `<=`)) %>%
rename(ID = ID.x) %>%
select(names(maindf), ASSIGN_DATE)
# ID SURVEY_DATE ASSIGN_DATE
#1 1 1997-08-01 1997
#2 1 1998-08-20 1997
#3 1 1998-11-20 1998
#4 1 2000-12-13 2000
#5 5 1998-05-02 1997
#6 5 1998-12-25 1998
data.table "non-equi join" for the win: data.table“非equi join”获胜:
#re-create data as data.tables and with lowercase
library(data.table)
maindt <- data.table(
id = c(1, 1, 1, 1, 5, 5),
sdate = as.Date(c("1997-08-01", "1998-08-20", "1998-11-20", "2000-12-13", "1998-05-02", "1998-12-25")))
otherdt <- data.table(
id = c(1, 1, 1, 1, 5, 5),
adate = c(1997, 1998, 1999, 2000, 1997, 1998),
time1 = as.Date(c("1997-07-23", "1998-11-17", "1999-12-15", "2000-12-11", "1998-04-07", "1998-12-06")),
time2 = as.Date(c("1998-11-17", "1999-12-15", "2000-12-11", "2001-12-30", "1998-12-06", "1999-11-28"))
)
#one-line merge
maindt[otherdt, on = .(id==id, cond1 = sdate > time1, cond3 = sdate < time2), .(id, sdate=x.sdate, adate), nomatch=0]
The non-equi join syntax is a nightmare in my opinion, but I've always struggled with the dt1[dt2] merge style, so what do I know... 在我看来,非等号联接语法是一场噩梦,但我一直在努力应对dt1 [dt2]合并样式,因此我知道...
A base R solution using a full outer join and a conditional subset... 使用完全外部联接和条件子集的基本R解决方案...
#full outer join
foj <- merge(maindf, list, all = TRUE, by = "ID")
#conditional subset
df2 <- subset(foj, SURVEY_DATE >= TIME1 & SURVEY_DATE <= TIME2)
# > df2[, c("ID", "SURVEY_DATE", "ASSIGN_DATE")]
# ID SURVEY_DATE ASSIGN_DATE
# 1 1 1997-08-01 1997
# 5 1 1998-08-20 1997
# 10 1 1998-11-20 1998
# 16 1 2000-12-13 2000
# 17 5 1998-05-02 1997
# 20 5 1998-12-25 1998
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.