[英]Left join by group and condition (`tidyverse` or `data.table`)
I have a very large data frame that includes integer columns state
and state_cyclen
.我有一个非常大的数据框,其中包括 integer 列
state
和state_cyclen
。 Every row is a gameframe, while state
describes the state a game is in at that frame and state_cyclen
is coded to indicate n occurrence of that state (it is basically data.table::rleid(state)
). Every row is a gameframe, while
state
describes the state a game is in at that frame and state_cyclen
is coded to indicate n occurrence of that state (it is data.table::rleid(state)
). Conditioning on state
and cycling by state_cyclen
I need to import several columns from other definitions data frames.调节
state
state_cyclen
我需要从其他定义数据帧中导入几列。 Definition data frames store properties about state and their row ordering informs on the way these properties are cycled throughout the game (players encounter each game state many times).定义数据框存储有关 state 的属性,它们的行顺序告知这些属性在整个游戏中循环的方式(玩家多次遇到每个游戏 state)。
A minimal example of the long data that should be left joined:应该保持连接的长数据的最小示例:
data <- data.frame(
state = c(1, 1, 2, 2, 3, 3, 1, 1, 2, 2, 3, 3, 2, 2, 3, 3, 3, 4, 4, 3, 3),
state_cyclen = c(1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 1, 1, 4, 4)
)
data
#> state state_cyclen
#> 1 1 1
#> 2 1 1
#> 3 2 1
#> 4 2 1
#> 5 3 1
#> 6 3 1
#> 7 1 2
#> 8 1 2
#> 9 2 2
#> 10 2 2
#> 11 3 2
#> 12 3 2
#> 13 2 3
#> 14 2 3
#> 15 3 3
#> 16 3 3
#> 17 3 3
#> 18 4 1
#> 19 4 1
#> 20 3 4
#> 21 3 4
Minimal example for definition data frames storing the ordering:存储排序的定义数据帧的最小示例:
def_one <- data.frame(
prop = letters[1:3],
others = LETTERS[1:3]
)
def_two <- data.frame(
prop = letters[4:10],
others = LETTERS[4:10]
)
def_three <- data.frame(
prop = letters[11:12],
others = LETTERS[11:12]
)
I have a solution written in base R that gives the desired output, but it's neither very readable, nor probably very efficient.我有一个用基础 R 编写的解决方案,它给出了所需的 output,但它既不可读,也可能非常有效。
# Add empty columns
data$prop <- NA
data$others <- NA
# Function that recycles numeric vector bounded by a upper limit
bounded_vec_recyc <- function(vec, n) if(n == 1) vec else (vec - 1) %% n + 1
# My solution
vec_pos_one <- data[data[, "state"] == 1, ]$state_cyclen
vec_pos_one <- bounded_vec_recyc(vec_pos_one, n = nrow(def_one))
data[data[, "state"] == 1, ][, c("prop", "others")] <- def_one[vec_pos_one,]
vec_pos_two <- data[data[, "state"] == 2, ]$state_cyclen
vec_pos_two <- bounded_vec_recyc(vec_pos_two, n = nrow(def_two))
data[data[, "state"] == 2, ][, c("prop", "others")] <- def_two[vec_pos_two,]
vec_pos_three <- data[data[, "state"] == 3, ]$state_cyclen
vec_pos_three <- bounded_vec_recyc(vec_pos_three, n = nrow(def_three))
data[data[, "state"] == 3, ][, c("prop", "others")] <- def_three[vec_pos_three,]
data
#> state state_cyclen prop others
#> 1 1 1 a A
#> 2 1 1 a A
#> 3 2 1 d D
#> 4 2 1 d D
#> 5 3 1 k K
#> 6 3 1 k K
#> 7 1 2 b B
#> 8 1 2 b B
#> 9 2 2 e E
#> 10 2 2 e E
#> 11 3 2 l L
#> 12 3 2 l L
#> 13 2 3 f F
#> 14 2 3 f F
#> 15 3 3 k K
#> 16 3 3 k K
#> 17 3 3 k K
#> 18 4 1 <NA> <NA>
#> 19 4 1 <NA> <NA>
#> 20 3 4 l L
#> 21 3 4 l L
Created on 2022-08-30 with reprex v2.0.2使用reprex v2.0.2创建于 2022-08-30
TLDR: As you can see, I am basically trying to merge one by one these definition data frames to the main data frame on corresponding state
by recycling the rows of the definition data frame while retaining their order, using the state_cyclen
column to keep track of occurrences of each state throughout the game. TLDR:如您所见,我基本上试图通过回收定义数据帧的
state_cyclen
state
来跟踪整个游戏中每个 state 的出现次数。
Is there a way to do this within the tidyverse
or data.table
that is faster or at least easier to read?有没有办法在
tidyverse
或data.table
中更快或至少更容易阅读? I need this to be quite fast as I have many such gameframe files (in the hundreds) and they are lengthy (hundreds of thousands of rows).我需要这个速度非常快,因为我有很多这样的游戏框架文件(数百个)而且它们很长(数十万行)。
PS Not sure if title is adequate for the operations I am doing, as I can imagine multiple ways of implementation. PS不确定标题是否足以满足我正在执行的操作,因为我可以想象多种实现方式。 Edits on it are welcome.
欢迎对其进行编辑。
Here, I make a lookup table combining the three sources.在这里,我制作了一个结合三个来源的查找表。 Then I join the data with the number of rows for each state, modify the
state_cyclen
in data
using modulo with that number to be within the lookup range, then join.然后我将数据与每个 state 的行数连接起来,使用该数字取模修改
data
中的state_cyclen
以使其在查找范围内,然后连接。
library(tidyverse)
def <- bind_rows(def_one, def_two, def_three, .id = "state") %>%
mutate(state = as.numeric(state)) %>%
group_by(state) %>%
mutate(state_cyclen_adj = row_number()) %>%
ungroup()
data %>%
left_join(def %>% count(state)) %>%
# eg for row 15 we change 3 to 1 since the lookup table only has 2 rows
mutate(state_cyclen_adj = (state_cyclen - 1) %% n + 1) %>%
left_join(def)
Joining, by = "state"
Joining, by = c("state", "state_cyclen_adj")
state state_cyclen n state_cyclen_adj prop others
1 1 1 3 1 a A
2 1 1 3 1 a A
3 2 1 7 1 d D
4 2 1 7 1 d D
5 3 1 2 1 k K
6 3 1 2 1 k K
7 1 2 3 2 b B
8 1 2 3 2 b B
9 2 2 7 2 e E
10 2 2 7 2 e E
11 3 2 2 2 l L
12 3 2 2 2 l L
13 2 3 7 3 f F
14 2 3 7 3 f F
15 3 3 2 1 k K
16 3 3 2 1 k K
17 3 3 2 1 k K
18 4 1 NA NA <NA> <NA>
19 4 1 NA NA <NA> <NA>
20 3 4 2 2 l L
21 3 4 2 2 l L
Here is a data.table
solution.这是
data.table
解决方案。 Not sure it is easier to read, but pretty sure it is more efficient:不确定它是否更容易阅读,但可以肯定它更有效:
library(data.table)
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
data <- setDT(data)
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
dt[data,on = c("state","state_cyclen")]
prop others state state_cyclen
1: a A 1 1
2: a A 1 1
3: d D 2 1
4: d D 2 1
5: k K 3 1
6: k K 3 1
7: b B 1 2
8: b B 1 2
9: e E 2 2
10: e E 2 2
11: l L 3 2
12: l L 3 2
13: f F 2 3
14: f F 2 3
15: k K 3 1
16: k K 3 1
17: k K 3 1
18: <NA> <NA> 4 1
19: <NA> <NA> 4 1
20: l L 3 2
21: l L 3 2
prop others state state_cyclen
By step: I bind the def_one, def_two and def_three dataframes to create a data.table with the variable you need to merge一步一步:我绑定 def_one、def_two 和 def_three 数据帧以创建一个 data.table 与您需要合并的变量
dt <- rbind(setDT(def_one)[,state := 1],
setDT(def_two)[,state := 2],
setDT(def_three)[,state := 3])
dt[,state_cyclen := 1:.N,by = state]
In case you want to merge a lot of dataframes, you can use rbindlist
and a list of data.tables.如果你想合并很多数据框,你可以使用
rbindlist
和 data.tables 列表。
I then modify your state_cyclen
in data to do the same recycling than you:然后我修改数据中的
state_cyclen
以执行与您相同的回收:
dt[,.N,by = state]
state N
1: 1 3
2: 2 7
3: 3 2
gives the lengths you use to define your recycling.给出您用来定义回收的长度。
data[dt[,.N,by = state],
state_cyclen := bounded_vec_recyc(state_cyclen,i.N),
on = "state",
by = .EACHI]
I use the by =.EACHI
to modify the variable for each group during the merge, using the N
variable from dt[,.N,by = state]
我使用
by =.EACHI
在合并期间修改每个组的变量,使用来自dt[,.N,by = state]
的N
变量
Then I just have to do the left join:然后我只需要做左连接:
dt[data,on = c("state","state_cyclen")]
An option with nest/unnest
nest/unnest
选项
library(dplyr)
library(tidyr)
data %>%
nest_by(state) %>%
left_join(tibble(state = 1:3, dat = list(def_one, def_two, def_three))) %>%
mutate(data = list(bind_cols(data, if(!is.null(dat))
dat[data %>%
pull(state_cyclen) %>%
bounded_vec_recyc(., nrow(dat)),] else NULL)), dat = NULL) %>%
ungroup %>%
unnest(data)
-output -输出
# A tibble: 21 × 4
state state_cyclen prop others
<dbl> <dbl> <chr> <chr>
1 1 1 a A
2 1 1 a A
3 1 2 b B
4 1 2 b B
5 2 1 d D
6 2 1 d D
7 2 2 e E
8 2 2 e E
9 2 3 f F
10 2 3 f F
# … with 11 more rows
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.