[英]R - Merging Two Data.Frames with Row-Level Conditional Variables
簡短版:我想使用dplyr或merge進行優化,因此比平常的合並操作要復雜一些。 我已經有很多解決方案,但是這些解決方案在大型數據集上的運行速度非常慢,我很好奇R中是否存在更快的方法(或者在SQL或python中)
我有兩個data.frames:
問題是:商店ID是特定位置的唯一標識符,但是商店位置可能會將所有權從一個期間更改為下一個期間(並且出於完整性考慮,沒有兩個所有者可以同時擁有同一家商店)。 因此,當我合並商店級別信息時,我需要某種條件來合並正確時期的商店級別信息。
可重現的示例:
# asynchronous log.
# t for period.
# Store for store loc ID
# var1 just some variable.
set.seed(1)
df <- data.frame(
t = c(1,1,1,2,2,2,3,3,4,4,4),
Store = c(1,2,3,1,2,3,1,3,1,2,3),
var1 = runif(11,0,1)
)
# Store table
# You can see, lots of store location opening and closing,
# StateDate is when this business came into existence
# Store is the store id from df
# CloseDate is when this store when out of business
# storeVar1 is just some important var to merge over
Stores <- data.frame(
StartDate = c(0,0,0,4,4),
Store = c(1,2,3,2,3),
CloseDate = c(9,2,3,9,9),
storeVar1 = c("a","b","c","d","e")
)
現在,我只想合並Store
df中的信息以進行記錄(如果該Store
在那個時期( t
)營業)。 CloseDate
和StartDate
指示該業務運營的最后和第一期。 ( 出於完整性考慮,但不是太重要,對於StartDate
0,該商店自樣本之前就存在。對於CloseDate
9,該商店在樣本結束時並沒有在該位置倒閉。 )
一種解決方案依賴於周期t
級split()
和dplyr::rbind_all()
,例如
# The following seems to do the trick.
complxMerge_v1 <- function(df, Stores, by = "Store"){
library("dplyr")
temp <- split(df, df$t)
for (Period in names(temp))(
temp[[Period]] <- dplyr::left_join(
temp[[Period]],
dplyr::filter(Stores,
StartDate <= as.numeric(Period) &
CloseDate >= as.numeric(Period)),
by = "Store"
)
)
df <- dplyr::rbind_all(temp); rm(temp)
df
}
complxMerge_v1(df, Stores, "Store")
從功能上來說,這似乎可行(無論如何都沒有遇到重大錯誤)。 但是,我們正在處理(越來越常見的)數十億行日志數據。
如果您想將其用於基准測試,我在sense.io上做了一個較大的可重現示例。 參見此處: https : //sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals
兩個問題:
在R中,您可以看一下data.table::foverlaps
函數
library(data.table)
# Set start and end values in `df` and key by them and by `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]
setkey(df, Store, StartDate, CloseDate)
# Run `foverlaps` function
foverlaps(setDT(Stores), df)
# Store t var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
# 1: 1 1 0.26550866 1 1 0 9 a
# 2: 1 2 0.90820779 2 2 0 9 a
# 3: 1 3 0.94467527 3 3 0 9 a
# 4: 1 4 0.62911404 4 4 0 9 a
# 5: 2 1 0.37212390 1 1 0 2 b
# 6: 2 2 0.20168193 2 2 0 2 b
# 7: 3 1 0.57285336 1 1 0 3 c
# 8: 3 2 0.89838968 2 2 0 3 c
# 9: 3 3 0.66079779 3 3 0 3 c
# 10: 2 4 0.06178627 4 4 4 9 d
# 11: 3 4 0.20597457 4 4 4 9 e
您可以添加t
列來轉換Stores
data.frame,其中包含確定商店的t
所有值,然后使用Hadley的tydir
包中的unnest
函數將其轉換為“長”形式。
require("tidyr")
require("dplyr")
complxMerge_v2 <- function(df, Stores, by = NULL) {
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
unnest(t) %>% left_join(df, ., by = by)
}
complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")
microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)
# Unit: milliseconds
# expr min lq mean median uq max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962 10
# complxMerge_v2(df, Stores) 532.744 539.743 567.7207 561.9635 588.0637 636.5775 10
以下是逐步說明結果,以使過程清晰明了。
Stores_with_t <-
Stores %>% mutate(., t = lapply(1:nrow(.),
function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2 0 2 2 b 0, 1, 2
# 3 0 3 3 c 0, 1, 2, 3
# 4 4 2 9 d 4, 5, 6, 7, 8, 9
# 5 4 3 9 e 4, 5, 6, 7, 8, 9
# After that `unnest(t)`
Stores_with_t_unnest <-
with_t %>% unnest(t)
# StartDate Store CloseDate storeVar1 t
# 1 0 1 9 a 0
# 2 0 1 9 a 1
# 3 0 1 9 a 2
# 4 0 1 9 a 3
# 5 0 1 9 a 4
# 6 0 1 9 a 5
# 7 0 1 9 a 6
# 8 0 1 9 a 7
# 9 0 1 9 a 8
# 10 0 1 9 a 9
# 11 0 2 2 b 0
# 12 0 2 2 b 1
# 13 0 2 2 b 2
# 14 0 3 3 c 0
# 15 0 3 3 c 1
# 16 0 3 3 c 2
# 17 0 3 3 c 3
# 18 4 2 9 d 4
# 19 4 2 9 d 5
# 20 4 2 9 d 6
# 21 4 2 9 d 7
# 22 4 2 9 d 8
# 23 4 2 9 d 9
# 24 4 3 9 e 4
# 25 4 3 9 e 5
# 26 4 3 9 e 6
# 27 4 3 9 e 7
# 28 4 3 9 e 8
# 29 4 3 9 e 9
# And then simple `left_join`
left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store var1 StartDate CloseDate storeVar1
# 1 1 1 0.26550866 0 9 a
# 2 1 2 0.37212390 0 2 b
# 3 1 3 0.57285336 0 3 c
# 4 2 1 0.90820779 0 9 a
# 5 2 2 0.20168193 0 2 b
# 6 2 3 0.89838968 0 3 c
# 7 3 1 0.94467527 0 9 a
# 8 3 3 0.66079779 0 3 c
# 9 4 1 0.62911404 0 9 a
# 10 4 2 0.06178627 4 9 d
# 11 4 3 0.20597457 4 9 e
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.