简体   繁体   English

按组和条件左连接(`tidyverse` 或 `data.table`)

[英]Left join by group and condition (`tidyverse` or `data.table`)

I have a very large data frame that includes integer columns state and state_cyclen .我有一个非常大的数据框,其中包括 integer 列statestate_cyclen Every row is a gameframe, while state describes the state a game is in at that frame and state_cyclen is coded to indicate n occurrence of that state (it is basically data.table::rleid(state) ). Every row is a gameframe, while state describes the state a game is in at that frame and state_cyclen is coded to indicate n occurrence of that state (it is data.table::rleid(state) ). Conditioning on state and cycling by state_cyclen I need to import several columns from other definitions data frames.调节state state_cyclen我需要从其他定义数据帧中导入几列。 Definition data frames store properties about state and their row ordering informs on the way these properties are cycled throughout the game (players encounter each game state many times).定义数据框存储有关 state 的属性,它们的行顺序告知这些属性在整个游戏中循环的方式(玩家多次遇到每个游戏 state)。

A minimal example of the long data that should be left joined:应该保持连接的长数据的最小示例:

data <- data.frame(
  state        = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 3, 4, 4, 3, 3),
  state_cyclen = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 4, 4)
)

data 
#>    state state_cyclen
#> 1      1            1
#> 2      1            1
#> 3      2            1
#> 4      2            1
#> 5      3            1
#> 6      3            1
#> 7      1            2
#> 8      1            2
#> 9      2            2
#> 10     2            2
#> 11     3            2
#> 12     3            2
#> 13     2            3
#> 14     2            3
#> 15     3            3
#> 16     3            3
#> 17     3            3
#> 18     4            1
#> 19     4            1
#> 20     3            4
#> 21     3            4

Minimal example for definition data frames storing the ordering:存储排序的定义数据帧的最小示例:

def_one <- data.frame(
  prop = letters[1:3],
  others = LETTERS[1:3]
)  

def_two <- data.frame(
  prop = letters[4:10],
  others = LETTERS[4:10]
) 

def_three <- data.frame(
  prop = letters[11:12],
  others = LETTERS[11:12]
) 

I have a solution written in base R that gives the desired output, but it's neither very readable, nor probably very efficient.我有一个用基础 R 编写的解决方案,它给出了所需的 output,但它既不可读,也可能非常有效。

# Add empty columns
data$prop <- NA
data$others <- NA

# Function that recycles numeric vector bounded by a upper limit 
bounded_vec_recyc <- function(vec, n) if(n == 1) vec else (vec - 1) %% n + 1

# My solution
vec_pos_one <- data[data[, "state"] == 1, ]$state_cyclen 
vec_pos_one <- bounded_vec_recyc(vec_pos_one, n = nrow(def_one))
data[data[, "state"] == 1, ][, c("prop", "others")] <- def_one[vec_pos_one,]
  

vec_pos_two <- data[data[, "state"] == 2, ]$state_cyclen 
vec_pos_two <- bounded_vec_recyc(vec_pos_two, n = nrow(def_two))
data[data[, "state"] == 2, ][, c("prop", "others")] <- def_two[vec_pos_two,]


vec_pos_three <- data[data[, "state"] == 3, ]$state_cyclen 
vec_pos_three <- bounded_vec_recyc(vec_pos_three, n = nrow(def_three))
data[data[, "state"] == 3, ][, c("prop", "others")] <- def_three[vec_pos_three,]

data
#>    state state_cyclen prop others
#> 1      1            1    a      A
#> 2      1            1    a      A
#> 3      2            1    d      D
#> 4      2            1    d      D
#> 5      3            1    k      K
#> 6      3            1    k      K
#> 7      1            2    b      B
#> 8      1            2    b      B
#> 9      2            2    e      E
#> 10     2            2    e      E
#> 11     3            2    l      L
#> 12     3            2    l      L
#> 13     2            3    f      F
#> 14     2            3    f      F
#> 15     3            3    k      K
#> 16     3            3    k      K
#> 17     3            3    k      K
#> 18     4            1 <NA>   <NA>
#> 19     4            1 <NA>   <NA>
#> 20     3            4    l      L
#> 21     3            4    l      L

Created on 2022-08-30 with reprex v2.0.2使用reprex v2.0.2创建于 2022-08-30

TLDR: As you can see, I am basically trying to merge one by one these definition data frames to the main data frame on corresponding state by recycling the rows of the definition data frame while retaining their order, using the state_cyclen column to keep track of occurrences of each state throughout the game. TLDR:如您所见,我基本上试图通过回收定义数据帧的state_cyclen state来跟踪整个游戏中每个 state 的出现次数。

Is there a way to do this within the tidyverse or data.table that is faster or at least easier to read?有没有办法在tidyversedata.table中更快或至少更容易阅读? I need this to be quite fast as I have many such gameframe files (in the hundreds) and they are lengthy (hundreds of thousands of rows).我需要这个速度非常快,因为我有很多这样的游戏框架文件(数百个)而且它们很长(数十万行)。

PS Not sure if title is adequate for the operations I am doing, as I can imagine multiple ways of implementation. PS不确定标题是否足以满足我正在执行的操作,因为我可以想象多种实现方式。 Edits on it are welcome.欢迎对其进行编辑。

Here, I make a lookup table combining the three sources.在这里,我制作了一个结合三个来源的查找表。 Then I join the data with the number of rows for each state, modify the state_cyclen in data using modulo with that number to be within the lookup range, then join.然后我将数据与每个 state 的行数连接起来,使用该数字取模修改data中的state_cyclen以使其在查找范围内,然后连接。

library(tidyverse)
def <- bind_rows(def_one, def_two, def_three, .id = "state") %>%
  mutate(state = as.numeric(state))  %>%
  group_by(state) %>%
  mutate(state_cyclen_adj = row_number()) %>%
  ungroup()

data %>%
  left_join(def %>% count(state)) %>%
  # eg for row 15 we change 3 to 1 since the lookup table only has 2 rows
  mutate(state_cyclen_adj = (state_cyclen - 1) %% n + 1) %>%
  left_join(def)


Joining, by = "state"
Joining, by = c("state", "state_cyclen_adj")
   state state_cyclen  n state_cyclen_adj prop others
1      1            1  3                1    a      A
2      1            1  3                1    a      A
3      2            1  7                1    d      D
4      2            1  7                1    d      D
5      3            1  2                1    k      K
6      3            1  2                1    k      K
7      1            2  3                2    b      B
8      1            2  3                2    b      B
9      2            2  7                2    e      E
10     2            2  7                2    e      E
11     3            2  2                2    l      L
12     3            2  2                2    l      L
13     2            3  7                3    f      F
14     2            3  7                3    f      F
15     3            3  2                1    k      K
16     3            3  2                1    k      K
17     3            3  2                1    k      K
18     4            1 NA               NA <NA>   <NA>
19     4            1 NA               NA <NA>   <NA>
20     3            4  2                2    l      L
21     3            4  2                2    l      L

Here is a data.table solution.这是data.table解决方案。 Not sure it is easier to read, but pretty sure it is more efficient:不确定它是否更容易阅读,但可以肯定它更有效:

library(data.table)

dt <- rbind(setDT(def_one)[,state := 1],
            setDT(def_two)[,state := 2],
            setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]

data <- setDT(data)
data[dt[,.N,by = state],
     state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
     on = "state",
     by = .EACHI]

dt[data,on = c("state","state_cyclen")]
    prop others state state_cyclen
 1:    a      A     1            1
 2:    a      A     1            1
 3:    d      D     2            1
 4:    d      D     2            1
 5:    k      K     3            1
 6:    k      K     3            1
 7:    b      B     1            2
 8:    b      B     1            2
 9:    e      E     2            2
10:    e      E     2            2
11:    l      L     3            2
12:    l      L     3            2
13:    f      F     2            3
14:    f      F     2            3
15:    k      K     3            1
16:    k      K     3            1
17:    k      K     3            1
18: <NA>   <NA>     4            1
19: <NA>   <NA>     4            1
20:    l      L     3            2
21:    l      L     3            2
    prop others state state_cyclen

By step: I bind the def_one, def_two and def_three dataframes to create a data.table with the variable you need to merge一步一步:我绑定 def_one、def_two 和 def_three 数据帧以创建一个 data.table 与您需要合并的变量

dt <- rbind(setDT(def_one)[,state := 1],
            setDT(def_two)[,state := 2],
            setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]

In case you want to merge a lot of dataframes, you can use rbindlist and a list of data.tables.如果你想合并很多数据框,你可以使用rbindlist和 data.tables 列表。

I then modify your state_cyclen in data to do the same recycling than you:然后我修改数据中的state_cyclen以执行与您相同的回收:

dt[,.N,by = state]

   state N
1:     1 3
2:     2 7
3:     3 2

gives the lengths you use to define your recycling.给出您用来定义回收的长度。

data[dt[,.N,by = state],
     state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
     on = "state",
     by = .EACHI]

I use the by =.EACHI to modify the variable for each group during the merge, using the N variable from dt[,.N,by = state]我使用by =.EACHI在合并期间修改每个组的变量,使用来自dt[,.N,by = state]N变量

Then I just have to do the left join:然后我只需要做左连接:

dt[data,on = c("state","state_cyclen")]

An option with nest/unnest nest/unnest选项

library(dplyr)
library(tidyr)
data %>% 
  nest_by(state) %>%
  left_join(tibble(state = 1:3, dat = list(def_one, def_two, def_three))) %>% 
  mutate(data = list(bind_cols(data, if(!is.null(dat))
    dat[data %>%
    pull(state_cyclen) %>%
    bounded_vec_recyc(., nrow(dat)),] else NULL)), dat = NULL) %>% 
  ungroup %>% 
  unnest(data)

-output -输出

# A tibble: 21 × 4
   state state_cyclen prop  others
   <dbl>        <dbl> <chr> <chr> 
 1     1            1 a     A     
 2     1            1 a     A     
 3     1            2 b     B     
 4     1            2 b     B     
 5     2            1 d     D     
 6     2            1 d     D     
 7     2            2 e     E     
 8     2            2 e     E     
 9     2            3 f     F     
10     2            3 f     F     
# … with 11 more rows

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM