简体   繁体   English

R:有条件地将数据从一个数据帧提取到另一个

[英]R: Conditionally extract data from one dataframe to another

I have two dataframes and I want to conditionally extract data from one column of one dataframe and put it into a new columnn of another datafrmae.我有两个数据帧,我想有条件地从一个数据帧的一列中提取数据并将其放入另一个数据帧的新列中。

dataframe 1 looks like this:数据框 1 如下所示:

df1 <- data.frame(date.start = c("2019-06-10 11:52:00",
  "2019-06-11 11:52:00", "2019-06-12 11:51:00"), date.end =
  c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"))

dataframe 2 looks like this:数据框 2 如下所示:

df2 <- data.frame(date.start = c("2019-06-11 11:50:00",
  "2019-06-10 11:51:00", "2019-06-12 11:50:00"), date.end =
  c("2019-06-11 11:54:00", "2019-06-11 08:59:00", "2019-06-12 11:57:00"),
  day = c(1, 15, 64))

If the date.start and date.end of df.1 fall within the date.start or date.end of any row of df2 I want to extract the variable day from df2 and put it in to the matching row of df1 .如果date.startdate.end落在df2任何行的date.startdate.end内,我想从df2提取变量day并将其放入df1的匹配行中。

The expected outcome looks like this:预期结果如下所示:

expected.out <- data.frame(date.start = c("2019-06-10 11:52:00", "2019-06-11 11:52:00", "2019-06-12 11:51:00"),
                           date.end = c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"),
                           day = c(15, 1, 64))

I currently have the following loop that works, but it is very slow when I run it on my large dataframe (rows = 1135133), and I am wondering if there is a faster way of doing this.我目前有以下循环有效,但是当我在我的大数据帧(行 = 1135133)上运行它时它非常慢,我想知道是否有更快的方法来做到这一点。

for(i in 1:nrow(df1)){
  find.match <- which(df1$date.start[i] >= df2$date.start &
                        df1$date.end[i] <= df2$date.end)
  if(length(find.match) !=0){
    df1$day[i] <- df2$day[find.match]
  }
  
}

use library(fuzzyjoin)使用library(fuzzyjoin)

library(tidyverse)
library(lubridate)
library(fuzzyjoin)

df1 <- data.frame(
  date.start = c("2019-06-10 11:52:00", "2019-06-11 11:52:00", "2019-06-12 11:51:00"),
  date.end = c("2019-06-10 11:53:00", "2019-06-11 11:53:00", "2019-06-12 11:53:00"), stringsAsFactors = F)

df2 <- data.frame(date.start = c("2019-06-11 11:50:00", "2019-06-10 11:51:00", "2019-06-12 11:50:00"),
                  date.end = c("2019-06-11 11:54:00", "2019-06-11 08:59:00", "2019-06-12 11:57:00"),
                  day = c(1, 15, 64), stringsAsFactors = F)

df1 <- df1 %>% 
  mutate(across(where(is.character), ymd_hms)) %>% 
  as_tibble()

df2 <- df2 %>% 
  mutate(across(where(is.character), ymd_hms)) %>% 
  as_tibble()


fuzzy_left_join(df1, df2, by = c("date.start", "date.end"), match_fun = list(`>=`, `<=`))
# A tibble: 3 x 5
  date.start.x        date.end.x          date.start.y        date.end.y            day
  <dttm>              <dttm>              <dttm>              <dttm>              <dbl>
1 2019-06-10 11:52:00 2019-06-10 11:53:00 2019-06-10 11:51:00 2019-06-11 08:59:00    15
2 2019-06-11 11:52:00 2019-06-11 11:53:00 2019-06-11 11:50:00 2019-06-11 11:54:00     1
3 2019-06-12 11:51:00 2019-06-12 11:53:00 2019-06-12 11:50:00 2019-06-12 11:57:00    64

Created on 2020-09-23 by the reprex package (v0.3.0)reprex 包(v0.3.0) 于 2020 年 9 月 23 日创建

not sure if the method is fast不确定该方法是否快速

You can use match inside sapply to get the first row of df2 where the dates are indside the given time range.您可以在sapply使用match来获取df2的第一行,其中日期在给定的时间范围内。

df1[] <- lapply(df1, as.POSIXct) #Convert character to POSIXct
df2[1:2] <- lapply(df2[1:2], as.POSIXct)

df1$day <- df2$day[sapply(asplit(df1, 1), function(x) {match(TRUE,
 x[1] >= df2[,1] & x[2] <= df2[,2])})]
df1
#           date.start            date.end day
#1 2019-06-10 11:52:00 2019-06-10 11:53:00  15
#2 2019-06-11 11:52:00 2019-06-11 11:53:00   1
#3 2019-06-12 11:51:00 2019-06-12 11:53:00  64

Using between from data.table with outer .使用betweendata.tableouter The which.max scans for the TRUE value in the matching matrix. which.max扫描匹配矩阵中的TRUE值。

library(data.table)
FUN <- Vectorize(function(x, y) all(between(unlist(df1[x, ]), df2[y, 1], df2[y, 2])))
res <- transform(df1, day=df2[apply(outer(1:3, 1:3, FUN), 1, which.max), 3])
res
#            date.start            date.end day
# 1 2019-06-10 11:52:00 2019-06-10 11:53:00  15
# 2 2019-06-11 11:52:00 2019-06-11 11:53:00   1
# 3 2019-06-12 11:51:00 2019-06-12 11:53:00  64

You may want to convert to POSIXct format beforehand to apply solution.您可能希望事先转换为POSIXct格式以应用解决方案。

df1[1:2] <- lapply(df1[1:2], as.POSIXct)
df2[1:2] <- lapply(df2[1:2], as.POSIXct)

Data:数据:

df1 <- structure(list(date.start = structure(c(1560160320, 1560246720, 
1560333060), class = c("POSIXct", "POSIXt"), tzone = ""), date.end = structure(c(1560160380, 
1560246780, 1560333180), class = c("POSIXct", "POSIXt"), tzone = "")), row.names = c(NA, 
-3L), class = "data.frame")

df2 <- structure(list(date.start = structure(c(1560246600, 1560160260, 
1560333000), class = c("POSIXct", "POSIXt"), tzone = ""), date.end = structure(c(1560246840, 
1560236340, 1560333420), class = c("POSIXct", "POSIXt"), tzone = ""), 
    day = c(1, 15, 64)), row.names = c(NA, -3L), class = "data.frame")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM