简体   繁体   English

有没有一种优雅的方法可以在 R 中按时间戳合并两个数据帧?

[英]Is there an elegent way to merge two data frame by timestamp in R?

Suppose I have two data frame, df1 and df2.假设我有两个数据框,df1 和 df2。

df1 <- data.frame(value = 1:5, timestamp = as.POSIXct( c( "2020-03-02 12:20:00", "2020-03-02 12:20:01", "2020-03-02 12:20:03" , "2020-03-02 12:20:05", "2020-03-02 12:20:08")))

df2 <- data.frame(value = 6:10, timestamp = as.POSIXct( c( "2020-03-02 12:20:01", "2020-03-02 12:20:02", "2020-03-02 12:20:03" , "2020-03-02 12:20:04", "2020-03-02 12:20:05")))

df1 df1

value价值 timestamp时间戳
1 1 2020-03-02 12:20:00 2020-03-02 12:20:00
2 2 2020-03-02 12:20:01 2020-03-02 12:20:01
3 3 2020-03-02 12:20:03 2020-03-02 12:20:03
4 4 2020-03-02 12:20:05 2020-03-02 12:20:05
5 5 2020-03-02 12:20:08 2020-03-02 12:20:08

df2 df2

value价值 timestamp时间戳
6 6 2020-03-02 12:20:01 2020-03-02 12:20:01
7 7 2020-03-02 12:20:02 2020-03-02 12:20:02
8 8 2020-03-02 12:20:03 2020-03-02 12:20:03
9 9 2020-03-02 12:20:04 2020-03-02 12:20:04
10 10 2020-03-02 12:20:05 2020-03-02 12:20:05

Now, I want to keep df1, and left join with df2 by timestamp, since the timestamp is not exactly the same, what I want to do is:现在,我想保留 df1,并通过时间戳与 df2 左连接,因为时间戳不完全相同,我想要做的是:

  1. If there is an exact match, then just left join the value from df2如果存在完全匹配,则只需左加入 df2 中的值
  2. If there is not an exact match, then try to match with the latest timestamp, and left join that value如果没有完全匹配,则尝试匹配最新的时间戳,并左加入该值
  3. If there is not a match (no latest timestamp), then return NA如果没有匹配(没有最新的时间戳),则返回 NA

Therefore, my expect output would be like this因此,我期望 output 会是这样的

data.frame(df1, value.df2 = c(NA, 6, 8, 10, 10))
value价值 timestamp时间戳 value.df2值.df2
1 1 2020-03-02 12:20:00 2020-03-02 12:20:00 NA不适用
2 2 2020-03-02 12:20:01 2020-03-02 12:20:01 6 6
3 3 2020-03-02 12:20:03 2020-03-02 12:20:03 8 8
4 4 2020-03-02 12:20:05 2020-03-02 12:20:05 10 10
5 5 2020-03-02 12:20:08 2020-03-02 12:20:08 10 10

I hope I could do this by tidyverse or data.table.我希望我可以通过 tidyverse 或 data.table 来做到这一点。

Here are several alternatives.这里有几种选择。 I find the SQL solution the most descriptive.我发现 SQL 解决方案最具描述性。 The base solution is pretty short and has no dependencies.基本解决方案很短,没有依赖关系。 The data.table approach is likely fast and the code is compact but you need to read the documentation carefully to determine whether or not it is doing what you want since it is not obvious from the code unlike the prior two solutions. data.table 方法可能很快并且代码很紧凑,但是您需要仔细阅读文档以确定它是否正在执行您想要的操作,因为与前两种解决方案不同,它从代码中并不明显。 The dplyr/fuzzyjoin solution may be of interest if you are using the tidyverse.如果您使用的是 tidyverse,可能会感兴趣 dplyr/fuzzyjoin 解决方案。

1) sqldf Perform a left self join such that we join to each a row all b rows having a timestamp less than or equal to it and then take only the b row having the maximum timestamp of the ones joined to each a row. 1) sqldf执行左自连接,这样我们将所有b行的时间戳小于或等于它的所有 b 行连接到每个a行,然后只取b行具有连接到每个a行的最大时间戳的行。 Note that SQLite guarantees that when max is used on a particular field that any other column references in the same table will be to that same row.请注意,SQLite 保证当在特定字段上使用 max 时,同一表中的任何其他列引用都将指向同一行。

For large data add the argument dbname = tempfile() to the sqldf call and it will perform the join out of memory so that R memory limitations don't apply.对于大数据,将参数dbname = tempfile()添加到sqldf调用,它将执行 memory 的连接,因此 R memory 限制不适用。 It would also be possible to add an index to the data to speed it up.也可以为数据添加索引以加快速度。

library(sqldf)

sqldf("select max(b.timestamp), a.*, b.value as 'value.df2'
  from df1 a
  left join df2 b on b.timestamp <= a.timestamp
  group by a.timestamp
  order by a.timestamp"
)[-1]

giving:给予:

  value           timestamp value.df2
1     1 2020-03-02 12:20:00        NA
2     2 2020-03-02 12:20:01         6
3     3 2020-03-02 12:20:03         8
4     4 2020-03-02 12:20:05        10
5     5 2020-03-02 12:20:08        10

Note that it can be used within a magrittr pipeline by placing the sqldf statement within brace brackets and referring to the left hand side as [.] within the sql statement:请注意,它可以通过将 sqldf 语句放在大括号内并在 sql 语句中将左侧称为[.]来在 magrittr 管道中使用:

library(magrittr)
library(sqldf)

df1 %>%
  { sqldf("select max(b.timestamp), a.*, b.value as 'value.df2'
      from [.] a
      left join df2 b on b.timestamp <= a.timestamp
      group by a.timestamp
      order by a.timestamp")[-1]
  }

2) base For each timestamp find the ones that are less than or equal to it and take the last one or NA if none. 2) base对于每个时间戳,找到小于或等于它的时间戳,如果没有,则取最后一个或 NA。

Match <- function(tt) with(df2, tail(c(NA, value[timestamp <= tt]), 1))
transform(df1, value.df2 = sapply(timestamp, Match))

3) data.table This package supports rolling joins: 3) data.table package 支持滚动连接:

as.data.table(df2)[df1, on = .(timestamp), roll = TRUE]

4) dplyr/fuzzyjoin the fuzzy_left_join joins all rows of df2 to df1 whose timestamp is less than or equal to it. 4) dplyr/fuzzyjoin ,fuzzy_left_join 将 df2 的所有行连接到时间戳小于或等于它的 df1。 Then for each joined row we take the last one and fix up the names.然后对于每个连接的行,我们取最后一个并修复名称。

library(dplyr)
library(fuzzyjoin)

df1 %>%
  fuzzy_left_join(df2, by = "timestamp", match_fun = `>=`) %>%
  group_by(timestamp.x) %>%
  slice(n = n()) %>%
  ungroup %>%
  select(timestamp = timestamp.x, value = value.x, value.df2 = value.y)

  

Use tidyverse package this simple way使用tidyverse package 这种简单的方法

df1 <- data.frame(value = 1:5, timestamp = as.POSIXct( c( "2020-03-02 12:20:00", "2020-03-02 12:20:01", "2020-03-02 12:20:03" , "2020-03-02 12:20:05", "2020-03-02 12:20:08")))
df2 <- data.frame(value = 6:10, timestamp = as.POSIXct( c( "2020-03-02 12:20:01", "2020-03-02 12:20:02", "2020-03-02 12:20:03" , "2020-03-02 12:20:04", "2020-03-02 12:20:05")))

library(tidyverse)
left_join(df1, df2, by = 'timestamp')

 #value.x           timestamp value.y
 #1       1 2020-03-02 12:20:00      NA
 #2       2 2020-03-02 12:20:01       6
 #3       3 2020-03-02 12:20:03       8
 #4       4 2020-03-02 12:20:05      10
 #5       5 2020-03-02 12:20:08      NA

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM