繁体   English   中英

匹配两个data.frames之间的最近日期

[英]Matching nearest date between two data.frames

我有一个data.frame ,其中包含在特定日期与不同运营商一起乘坐火车的乘客数量。

df<-data.frame(date_of_sampling =c("2021-01-01","2021-02-04","2021-01-03","2021-02-03"),operator=c("A","A","B","B"),num_passengers=c(204,155,100,400))

然后,我有另一个data.frame ,每周显示 COVID 的流行情况。

ONS <- data.frame(sample_date_midpoint=c("2020-05-03","2020-06-10","2020-06-20","2020-08-03","2021-01-01","2021-01-06","2021-02-05","2021-02-08"),prevalence=runif(8))

我想将最接近的 ONS 流行数据与df中最接近的日期相匹配。

到目前为止,我有:

基地 R

# get time differences
temp <- outer(df$date_of_sampling, ONS$sample_date_midpoint,  "-")

# remove where ONS are more than 5 days before or after df
temp[temp < -5 | temp > 5] <- NA

# find index of minimum
ind <- apply(temp, 1, function(i) which.min(i))

# output
df2 <- cbind(ONS,  df[ind,])

问题:如何找到唯一的绑定日期?

Data.table方法

setDT(df)            ## convert to data.table by reference
setDT(ONS)            ## same

df[, date := date_of_sampling]  ## create a duplicate of 'df'
setkey(df, date_of_sampling)    ## set the column to perform the join on
ONS[, date := sample_date_midpoint]  ## create a duplicate of 'ONS'
setkey(ONS, date)    ## same as above

ONS[df, roll=5] 

有效,但是,如果有多个采样日靠近会发生什么?

Dplyr 方法?

您可以将字符日期转换为Date并使用roll='nearest'

setDT(df)            ## convert to data.table by reference
setDT(ONS)            ## same

df[, date := as.Date(date_of_sampling)]  ## create a duplicate of 'df'
setkey(df, date)    ## set the column to perform the join on
ONS[, date := as.Date(sample_date_midpoint)]  ## create a duplicate of 'ONS'
setkey(ONS, date)    ## same as above

ONS[df, roll='nearest'][
    abs(difftime(sample_date_midpoint,date_of_sampling,unit='day'))<5]  

# Key: <date>
#   sample_date_midpoint prevalence       date date_of_sampling operator num_passengers
# <char>      <num>     <Date>           <char>   <char>          <num>
# 1:           2021-01-01  0.1964160 2021-01-01       2021-01-01        A            204
# 2:           2021-01-01  0.1964160 2021-01-03       2021-01-03        B            100
# 3:           2021-02-05  0.3906553 2021-02-03       2021-02-03        B            400
# 4:           2021-02-05  0.3906553 2021-02-04       2021-02-04        A            155

可能dplyr方法:

library(dplyr)

# Date formatting
ONS <- ONS |> mutate(sample_date_midpoint = as.Date(sample_date_midpoint))
df <- df |> mutate(date_of_sampling = as.Date(date_of_sampling))

# Identify closest + Join
df |>
  group_by(date_of_sampling) |>
   mutate(nearest_sample_date_midpoint = ONS$sample_date_midpoint[which.min(abs(ONS$sample_date_midpoint - first(date_of_sampling)))]) |>
  ungroup() |>
  left_join(ONS, by = c("nearest_sample_date_midpoint" = "sample_date_midpoint")) # |>
  # filter(as.numeric(nearest_sample_date_midpoint - date_of_sampling, unit = "days") < 5)

Output:

# A tibble: 5 × 5
  date_of_sampling operator num_passengers nearest_sample_date_midpoint prevalence
  <date>           <chr>             <dbl> <date>                            <dbl>
1 2021-01-01       A                   204 2021-01-01                       0.516 
2 2021-02-04       A                   155 2021-02-05                       0.0171
3 2021-01-03       B                   100 2021-01-01                       0.516 
4 2021-02-03       B                   400 2021-02-05                       0.0171
5 2019-02-03       C                  1000 2020-05-03                       0.208 

带有边缘情况的数据:

df <- data.frame(date_of_sampling = c("2021-01-01","2021-02-04","2021-01-03","2021-02-03", "2020-02-03"),
                 operator = c("A","A","B","B", "C"),
                 num_passengers = c(204,155,100,400,1000)
                 )

更新了边缘案例过滤器

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM