[英]R: Conditionally extract data from one dataframe to another
I have two dataframes and I want to conditionally extract data from one column of one dataframe and put it into a new columnn of another datafrmae.我有两个数据帧,我想有条件地从一个数据帧的一列中提取数据并将其放入另一个数据帧的新列中。
dataframe 1 looks like this:数据框 1 如下所示:
df1 <- data.frame(date.start = c("2019-06-10 11:52:00",
"2019-06-11 11:52:00", "2019-06-12 11:51:00"), date.end =
c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"))
dataframe 2 looks like this:数据框 2 如下所示:
df2 <- data.frame(date.start = c("2019-06-11 11:50:00",
"2019-06-10 11:51:00", "2019-06-12 11:50:00"), date.end =
c("2019-06-11 11:54:00", "2019-06-11 08:59:00", "2019-06-12 11:57:00"),
day = c(1, 15, 64))
If the date.start
and date.end
of df.1 fall within the date.start
or date.end
of any row of df2
I want to extract the variable day
from df2
and put it in to the matching row of df1
.如果date.start
和date.end
落在df2
任何行的date.start
或date.end
内,我想从df2
提取变量day
并将其放入df1
的匹配行中。
The expected outcome looks like this:预期结果如下所示:
expected.out <- data.frame(date.start = c("2019-06-10 11:52:00", "2019-06-11 11:52:00", "2019-06-12 11:51:00"),
date.end = c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"),
day = c(15, 1, 64))
I currently have the following loop that works, but it is very slow when I run it on my large dataframe (rows = 1135133), and I am wondering if there is a faster way of doing this.我目前有以下循环有效,但是当我在我的大数据帧(行 = 1135133)上运行它时它非常慢,我想知道是否有更快的方法来做到这一点。
for(i in 1:nrow(df1)){
find.match <- which(df1$date.start[i] >= df2$date.start &
df1$date.end[i] <= df2$date.end)
if(length(find.match) !=0){
df1$day[i] <- df2$day[find.match]
}
}
use library(fuzzyjoin)
使用library(fuzzyjoin)
library(tidyverse)
library(lubridate)
library(fuzzyjoin)
df1 <- data.frame(
date.start = c("2019-06-10 11:52:00", "2019-06-11 11:52:00", "2019-06-12 11:51:00"),
date.end = c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"), stringsAsFactors = F)
df2 <- data.frame(date.start = c("2019-06-11 11:50:00", "2019-06-10 11:51:00", "2019-06-12 11:50:00"),
date.end = c("2019-06-11 11:54:00", "2019-06-11 08:59:00", "2019-06-12 11:57:00"),
day = c(1, 15, 64), stringsAsFactors = F)
df1 <- df1 %>%
mutate(across(where(is.character), ymd_hms)) %>%
as_tibble()
df2 <- df2 %>%
mutate(across(where(is.character), ymd_hms)) %>%
as_tibble()
fuzzy_left_join(df1, df2, by = c("date.start", "date.end"), match_fun = list(`>=`, `<=`))
# A tibble: 3 x 5
date.start.x date.end.x date.start.y date.end.y day
<dttm> <dttm> <dttm> <dttm> <dbl>
1 2019-06-10 11:52:00 2019-06-10 11:53:00 2019-06-10 11:51:00 2019-06-11 08:59:00 15
2 2019-06-11 11:52:00 2019-06-11 11:53:00 2019-06-11 11:50:00 2019-06-11 11:54:00 1
3 2019-06-12 11:51:00 2019-06-12 11:53:00 2019-06-12 11:50:00 2019-06-12 11:57:00 64
Created on 2020-09-23 by the reprex package (v0.3.0)由reprex 包(v0.3.0) 于 2020 年 9 月 23 日创建
not sure if the method is fast不确定该方法是否快速
You can use match
inside sapply
to get the first row of df2
where the dates are indside the given time range.您可以在sapply
使用match
来获取df2
的第一行,其中日期在给定的时间范围内。
df1[] <- lapply(df1, as.POSIXct) #Convert character to POSIXct
df2[1:2] <- lapply(df2[1:2], as.POSIXct)
df1$day <- df2$day[sapply(asplit(df1, 1), function(x) {match(TRUE,
x[1] >= df2[,1] & x[2] <= df2[,2])})]
df1
# date.start date.end day
#1 2019-06-10 11:52:00 2019-06-10 11:53:00 15
#2 2019-06-11 11:52:00 2019-06-11 11:53:00 1
#3 2019-06-12 11:51:00 2019-06-12 11:53:00 64
Using between
from data.table
with outer
.使用between
从data.table
带outer
。 The which.max
scans for the TRUE
value in the matching matrix. which.max
扫描匹配矩阵中的TRUE
值。
library(data.table)
FUN <- Vectorize(function(x, y) all(between(unlist(df1[x, ]), df2[y, 1], df2[y, 2])))
res <- transform(df1, day=df2[apply(outer(1:3, 1:3, FUN), 1, which.max), 3])
res
# date.start date.end day
# 1 2019-06-10 11:52:00 2019-06-10 11:53:00 15
# 2 2019-06-11 11:52:00 2019-06-11 11:53:00 1
# 3 2019-06-12 11:51:00 2019-06-12 11:53:00 64
You may want to convert to POSIXct
format beforehand to apply solution.您可能希望事先转换为POSIXct
格式以应用解决方案。
df1[1:2] <- lapply(df1[1:2], as.POSIXct)
df2[1:2] <- lapply(df2[1:2], as.POSIXct)
Data:数据:
df1 <- structure(list(date.start = structure(c(1560160320, 1560246720,
1560333060), class = c("POSIXct", "POSIXt"), tzone = ""), date.end = structure(c(1560160380,
1560246780, 1560333180), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA,
-3L), class = "data.frame")
df2 <- structure(list(date.start = structure(c(1560246600, 1560160260,
1560333000), class = c("POSIXct", "POSIXt"), tzone = ""), date.end = structure(c(1560246840,
1560236340, 1560333420), class = c("POSIXct", "POSIXt"), tzone = ""),
day = c(1, 15, 64)), row.names = c(NA, -3L), class = "data.frame")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.