[英]What is the fastest way of creating an identificator for multi-row groups with data.table in R?
I have a dataframe that identifies a set of values with an id:我有一个 dataframe 标识一组具有 id 的值:
library(data.table)
dt <- data.table(
id = rep(c("a", "b", "c"), each = 2),
value1 = c(1, 1, 1, 2, 1, 1),
value2 = c(0, 3, 0, 3, 0, 3)
)
dt
#> id value1 value2
#> 1: a 1 0
#> 2: a 1 3
#> 3: b 1 0
#> 4: b 2 3
#> 5: c 1 0
#> 6: c 1 3
As you can see, the ids a
and c
identify both the same set of values.如您所见,id
a
和c
标识了同一组值。 So I want to create a "pattern id", that identifies the set of values associated with the ids a
and c
(obs: an id might identify more than two rows, I just limited them to two rows here for the sake of simplicity).所以我想创建一个“模式 id”,它标识与 ids
a
和c
关联的值集(obs:一个 id 可能标识多于两行,为了简单起见,我在这里将它们限制为两行) .
I did manage to come up with a solution using nested data.tables and match()
:我确实设法使用嵌套的 data.tables 和
match()
提出了一个解决方案:
dt <- dt[, .(data = list(.SD)), by = id]
unique_groups <- unique(dt$data)
dt[, pattern_id := match(data, unique_groups)]
dt[, data := NULL]
dt
#> id pattern_id
#> 1: a 1
#> 2: b 2
#> 3: c 1
It does the trick, but it is not as fast as I'd like it to be.它可以解决问题,但速度不如我希望的那样快。
match()
documentation is pretty clear regarding its efficiency with lists: match()
文档非常清楚地说明了列表的效率:
Matching for lists is potentially very slow and best avoided except in simple cases.
列表匹配可能非常缓慢,除非在简单情况下最好避免。
As you can see, I don't need the actual pattern data in my final result, only a table that associates the ids to the pattern ids.如您所见,我不需要最终结果中的实际模式数据,只需要一个将 id 与模式 id 相关联的表。 I feel like nesting the data, using it to match and then dropping it afterwards is a bit wasteful, but not sure if there's a better way.
感觉把数据嵌套起来,用它来匹配,然后再丢弃,有点浪费,但不确定是否有更好的方法。 I was thinking in something that transform each dataframe into a string, or, even better, something that avoided the nesting altogether, but I couldn't come up with anything better than what I have now.
我正在考虑将每个 dataframe 转换成字符串,或者更好的是,完全避免嵌套的东西,但我想不出比现在更好的东西。
I have created a bigger dataset to play around with and test different solutions:我创建了一个更大的数据集来尝试和测试不同的解决方案:
set.seed(0)
size <- 1000000
dt <- data.table(
id = rep(1:(size / 2), each = 2),
value1 = sample(1:10, size, replace = TRUE),
value2 = sample(1:10, size, replace = TRUE)
)
We can try the code below我们可以试试下面的代码
dt[
,
q := toString(unlist(.SD)), id
][
,
pattern_id := .GRP, q
][
,
q := NULL
][]
or或者
dt[
,
q := toString(unlist(.SD)),
id
][
,
pattern_id := as.integer(factor(match(q, q)))
][
,
q := NULL
][]
which gives这使
id value1 value2 pattern_id
1: a 1 0 1
2: a 1 3 1
3: b 1 0 2
4: b 2 3 2
5: c 1 0 1
6: c 1 3 1
Updated (to remove join):更新(删除加入):
This one replicates your approach (ie it requires that the order is the same as well as the values)这个复制了你的方法(即它要求顺序和值相同)
unique(
dt[, pattern:=.(paste0(c(value1,value2), collapse=",")), by=id][,.(id,pattern)]
)[,grp:=.GRP, by=pattern][,pattern:=NULL]
id grp
<char> <int>
1: a 1
2: b 2
3: c 1
Prior solution:先前的解决方案:
dt[dt[, .(paste0(sort(c(value1,value2)), collapse=",")), by=id] %>%
.[,pattern:=.GRP, by=V1] %>%
.[,V1:=NULL], on=.(id)]
Output: Output:
id value1 value2 pattern
<char> <num> <num> <int>
1: a 1 0 1
2: a 1 3 1
3: b 1 0 2
4: b 2 3 2
5: c 1 0 1
6: c 1 3 1
With toString
, as suggested by data.table
error message when using a list as by
:使用
toString
时,如data.table
错误消息所建议的那样,将列表用作by
时:
Column or expression 1 of 'by' is type 'list' which is not currently supported.
“by”的列或表达式 1 是当前不支持的类型“list”。
As a workaround, consider converting the column to a supported type, eg by=sapply(list_col, toString)作为解决方法,考虑将列转换为受支持的类型,例如 by=sapply(list_col, toString)
dt <- dt[, .(data = list(.SD)), by = id]
dt[, pattern_id :=.GRP, by = sapply(data, toString)]
dt[,unlist(data,recursive=F),by=.(id,pattern_id)]
id pattern_id value1 value2
<char> <int> <num> <num>
1: a 1 1 0
2: a 1 1 3
3: b 2 1 0
4: b 2 2 3
5: c 1 1 0
6: c 1 1 3
However, this is slower than match
.但是,这比
match
慢。
Assuming each id is repeated twice, "reshape" - convert 2x2 into 1x4 columns.假设每个id重复两次,“重塑”- 将 2x2 转换为 1x4 列。 Then get group ID using .GRP by grouping by all columns excluding id :
然后通过按除id之外的所有列分组,使用.GRP获取组 ID:
res <- dt[, c(.SD[ 1 ], .SD[ 2 ]), by = id]
setnames(res, make.unique(colnames(res)))
res[, pattern_id := .GRP, by = res[, -1] ][, .(id, pattern_id)]
# id pattern_id
# 1: 1 1
# 2: 2 2
# 3: 3 3
# 4: 4 4
# 5: 5 5
# ---
# 499996: 499996 1010
# 499997: 499997 3175
# 499998: 499998 3996
# 499999: 499999 3653
# 500000: 500000 4217
Using the bigger dataset takes about half a second.使用更大的数据集大约需要半秒钟。
Edit : another version using dcast , but it is 8x slower:编辑:另一个使用dcast的版本,但速度慢了 8 倍:
res <- dcast(dt, id ~ value1 + value2, length)
res[, pattern_id :=.GRP, by = res[, -1] ][, .(id, pattern_id)]
Here is some benchmarks with those that don't rely on each id identifying necessarily two rows and I'm posting the results below.这是一些不依赖于每个 id 的基准测试,这些 id 必须标识两行,我在下面发布结果。
library(data.table)
set.seed(0)
size <- 500000
dt <- data.table(
id = rep(1:(size / 2), each = 2),
value1 = sample(1:10, size, replace = TRUE),
value2 = sample(1:10, size, replace = TRUE)
)
my_solution <- function(x) {
x <- x[, .(data = list(.SD)), by = id]
unique_groups <- unique(x$data)
x[, pattern_id := match(data, unique_groups)]
x[, data := NULL]
x[]
}
langtang_solution <- function(x) {
x <- x[, .(data = paste0(value1, "|", value2, collapse = ";")), by = id]
x[, pattern_id := .GRP, by = data]
x[, data := NULL]
x[]
}
thomasiscoding_solution <- function(x) {
x <- x[, .(data = toString(unlist(.SD))), by = id]
x[, pattern_id := .GRP, by = data]
x[, data := NULL]
x[]
}
identical(my_solution(dt), langtang_solution(dt))
#> [1] TRUE
identical(my_solution(dt), thomasiscoding_solution(dt))
#> [1] TRUE
microbenchmark::microbenchmark(
my_solution(dt),
langtang_solution(dt),
thomasiscoding_solution(dt),
times = 50L
)
#> Unit: seconds
#> expr min lq mean median uq
#> my_solution(dt) 3.174106 3.566495 3.818829 3.793850 4.015176
#> langtang_solution(dt) 1.369860 1.467013 1.596558 1.529327 1.649607
#> thomasiscoding_solution(dt) 3.014511 3.154224 3.280713 3.256732 3.370015
#> max neval
#> 4.525275 50
#> 2.279064 50
#> 3.681657 50
This was very enriching.这非常丰富。 I didn't know about
.GRP
, which in my tests perform very similarly to match()
, although a (very small) bit better.我不知道
.GRP
,在我的测试中它的表现与match()
非常相似,尽管(非常小)好一点。 The best answer seems to be using paste()
to convert the group into a string and then finding the group based on that string.最好的答案似乎是使用
paste()
将组转换为字符串,然后根据该字符串找到组。
How about reshaping wider and using paste0()
?如何重塑更广泛和使用
paste0()
?
library(dplyr)
library(tidyr)
dt <- dt %>% group_by(id) %>%
mutate(inst = row_number(id)) %>%
pivot_wider(values_from = c(value1, value2),
names_from = inst) %>%
mutate(pattern_id = paste0(value1_1, value1_2, value2_1, value2_2))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.