簡體   English   中英

R-合並兩個具有行級條件變量的Data.Frames

[英]R - Merging Two Data.Frames with Row-Level Conditional Variables

簡短版:我想使用dplyr或merge進行優化,因此比平常的合並操作要復雜一些。 我已經有很多解決方案,但是這些解決方案在大型數據集上的運行速度非常慢,我很好奇R中是否存在更快的方法(或者在SQL或python中)


我有兩個data.frames:

  1. 與商店相關的事件的異步日志,以及
  2. 一個提供有關該日志中存儲的更多詳細信息的表。

問題是:商店ID是特定位置的唯一標識符,但是商店位置可能會將所有權從一個期間更改為下一個期間(並且出於完整性考慮,沒有兩個所有者可以同時擁有同一家商店)。 因此,當我合並商店級別信息時,我需要某種條件來合並正確時期的商店級別信息。


可重現的示例:

# asynchronous log. 
#  t for period. 
#  Store for store loc ID
#  var1 just some variable. 
set.seed(1)
df <- data.frame(
  t     = c(1,1,1,2,2,2,3,3,4,4,4),
  Store = c(1,2,3,1,2,3,1,3,1,2,3),
  var1 =  runif(11,0,1)
)

# Store table
# You can see, lots of store location opening and closing, 
#  StateDate is when this business came into existence
#  Store is the store id from df
#  CloseDate is when this store when out of business
#  storeVar1 is just some important var to merge over
Stores <- data.frame(
  StartDate = c(0,0,0,4,4),
  Store     = c(1,2,3,2,3),
  CloseDate = c(9,2,3,9,9),
  storeVar1 = c("a","b","c","d","e")
)

現在,我只想合並Store df中的信息以進行記錄(如果該Store在那個時期( t )營業)。 CloseDateStartDate指示該業務運營的最后和第一期。 出於完整性考慮,但不是太重要,對於StartDate 0,該商店自樣本之前就存在。對於CloseDate 9,該商店在樣本結束時並沒有在該位置倒閉。

一種解決方案依賴於周期tsplit()dplyr::rbind_all() ,例如

# The following seems to do the trick. 
complxMerge_v1 <- function(df, Stores, by = "Store"){
  library("dplyr")
  temp <- split(df, df$t)
  for (Period in names(temp))(
    temp[[Period]] <- dplyr::left_join(
      temp[[Period]],
      dplyr::filter(Stores, 
                    StartDate <= as.numeric(Period) & 
                    CloseDate >= as.numeric(Period)),
      by = "Store"
    )
  )
  df <- dplyr::rbind_all(temp); rm(temp)
  df
}
complxMerge_v1(df, Stores, "Store")

從功能上來說,這似乎可行(無論如何都沒有遇到重大錯誤)。 但是,我們正在處理(越來越常見的)數十億行日志數據。

如果您想將其用於基准測試,我在sense.io上做了一個較大的可重現示例。 參見此處: https//sense.io/economicurtis/r-faster-merging-of-two-data.frames-with-row-level-conditionals


兩個問題:

  1. 首先,還有另一種方法可以使用運行速度更快的類似方法來解決此問題?
  2. 是否有機會在SQL和Python中提供一種快速便捷的解決方案(我不太熟悉,但是如果需要的話可以依靠)。
  3. 另外,您可以幫助我以更一般,抽象的方式闡明這個問題嗎? 現在,我只知道如何用上下文特定的術語來談論這個問題,但是我很希望能夠用更合適的,但是更通用的編程或數據操作術語來談論這些類型的問題。

在R中,您可以看一下data.table::foverlaps函數

library(data.table)

# Set start and end values in `df` and key by them  and by  `Store`
setDT(df)[, c("StartDate", "CloseDate") := list(t, t)]      
setkey(df, Store, StartDate, CloseDate)

# Run `foverlaps` function
foverlaps(setDT(Stores), df)
#     Store t       var1 StartDate CloseDate i.StartDate i.CloseDate storeVar1
#  1:     1 1 0.26550866         1         1           0           9         a
#  2:     1 2 0.90820779         2         2           0           9         a
#  3:     1 3 0.94467527         3         3           0           9         a
#  4:     1 4 0.62911404         4         4           0           9         a
#  5:     2 1 0.37212390         1         1           0           2         b
#  6:     2 2 0.20168193         2         2           0           2         b
#  7:     3 1 0.57285336         1         1           0           3         c
#  8:     3 2 0.89838968         2         2           0           3         c
#  9:     3 3 0.66079779         3         3           0           3         c
# 10:     2 4 0.06178627         4         4           4           9         d
# 11:     3 4 0.20597457         4         4           4           9         e

您可以添加t列來轉換Stores data.frame,其中包含確定商店的t所有值,然后使用Hadley的tydir包中的unnest函數將其轉換為“長”形式。

require("tidyr")
require("dplyr")

complxMerge_v2 <- function(df, Stores, by = NULL)    {
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))%>%
    unnest(t) %>% left_join(df, ., by = by)
}

complxMerge_v2(df, Stores)
# Joining by: c("t", "Store")
#    t Store       var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e

require("microbenchmark")
# I've downloaded your large data samples
df <- read.csv("./df.csv")
Stores <- read.csv("./Stores.csv")

microbenchmark(complxMerge_v1(df, Stores), complxMerge_v2(df, Stores), times = 10L)

# Unit: milliseconds
#                       expr      min       lq      mean    median        uq       max neval
# complxMerge_v1(df, Stores) 9501.217 9623.754 9712.8689 9681.3808 9816.8984 9886.5962    10
# complxMerge_v2(df, Stores)  532.744  539.743  567.7207  561.9635  588.0637  636.5775    10

以下是逐步說明結果,以使過程清晰明了。

Stores_with_t <- 
  Stores %>% mutate(., t = lapply(1:nrow(.), 
                                  function(ii) (.)[ii, "StartDate"]:(.)[ii, "CloseDate"]))
#   StartDate Store CloseDate storeVar1                            t
# 1         0     1         9         a 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
# 2         0     2         2         b                      0, 1, 2
# 3         0     3         3         c                   0, 1, 2, 3
# 4         4     2         9         d             4, 5, 6, 7, 8, 9
# 5         4     3         9         e             4, 5, 6, 7, 8, 9

# After that `unnest(t)`

Stores_with_t_unnest <- 
  with_t %>% unnest(t)
#    StartDate Store CloseDate storeVar1 t
# 1          0     1         9         a 0
# 2          0     1         9         a 1
# 3          0     1         9         a 2
# 4          0     1         9         a 3
# 5          0     1         9         a 4
# 6          0     1         9         a 5
# 7          0     1         9         a 6
# 8          0     1         9         a 7
# 9          0     1         9         a 8
# 10         0     1         9         a 9
# 11         0     2         2         b 0
# 12         0     2         2         b 1
# 13         0     2         2         b 2
# 14         0     3         3         c 0
# 15         0     3         3         c 1
# 16         0     3         3         c 2
# 17         0     3         3         c 3
# 18         4     2         9         d 4
# 19         4     2         9         d 5
# 20         4     2         9         d 6
# 21         4     2         9         d 7
# 22         4     2         9         d 8
# 23         4     2         9         d 9
# 24         4     3         9         e 4
# 25         4     3         9         e 5
# 26         4     3         9         e 6
# 27         4     3         9         e 7
# 28         4     3         9         e 8
# 29         4     3         9         e 9

# And then simple `left_join`

left_join(df, Stores_with_t_unnest)
# Joining by: c("t", "Store")
# t Store          var1 StartDate CloseDate storeVar1
# 1  1     1 0.26550866         0         9         a
# 2  1     2 0.37212390         0         2         b
# 3  1     3 0.57285336         0         3         c
# 4  2     1 0.90820779         0         9         a
# 5  2     2 0.20168193         0         2         b
# 6  2     3 0.89838968         0         3         c
# 7  3     1 0.94467527         0         9         a
# 8  3     3 0.66079779         0         3         c
# 9  4     1 0.62911404         0         9         a
# 10 4     2 0.06178627         4         9         d
# 11 4     3 0.20597457         4         9         e

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM