I'm trying to join two data frames (data.tables), x and y, on x$val >= y$start & x$val <= y$end. I can't use dplyr because the only way to do inequality joins in dplyr is to join the tables then filter on the inequality, and the tables I want to join have 315k and 84k records. That would blow out memory.
data.table does have inequality joins, but I cannot for the life of me figure out how the syntax works. See this result:
x <- data.table(val = c(1:5), id = "a")
y <- data.table(start = c(1:5), end = c(11:15), id= "a")
x[y, on=c("val>=start","val<=end"),
.(start, val, end)]
start val end
1: 1 1 11
2: 1 1 11
3: 1 1 11
4: 1 1 11
5: 1 1 11
6: 2 2 12
7: 2 2 12
8: 2 2 12
9: 2 2 12
10: 3 3 13
11: 3 3 13
12: 3 3 13
13: 4 4 14
14: 4 4 14
15: 5 5 15
To show what I would expect to get, here is what dplyr produces:
x <- data.table(val = c(1:5), id = "a")
y <- data.table(start = c(1:5), end = c(11:15), id= "a")
x %>%
inner_join(y) %>%
filter(val >= start & val <= end)
val id start end
1 1 a 1 11
2 2 a 1 11
3 2 a 2 12
4 3 a 1 11
5 3 a 2 12
6 3 a 3 13
7 4 a 1 11
8 4 a 2 12
9 4 a 3 13
10 4 a 4 14
11 5 a 1 11
12 5 a 2 12
13 5 a 3 13
14 5 a 4 14
15 5 a 5 15
Can anyone explain what it is I'm missing with the data.table syntax?
You need to select val
column from the left table with prefix ( x.
); For more information, see this answer :
x[y, on=.(val >= start, val <= end), .(val = x.val, id, start, end)][order(val)]
# ^^
# val id start end
#1: 1 a 1 11
#2: 2 a 1 11
#3: 2 a 2 12
#4: 3 a 1 11
#5: 3 a 2 12
#6: 3 a 3 13
#7: 4 a 1 11
#8: 4 a 2 12
#9: 4 a 3 13
#10: 4 a 4 14
#11: 5 a 1 11
#12: 5 a 2 12
#13: 5 a 3 13
#14: 5 a 4 14
#15: 5 a 5 15
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.