[英]r compare two data.tables by row
I have two data.tables that I want to compare. 我有两个data.tables,我想比较。
But don't know why there is a warning 但不知道为什么会有警告
DT1 <- data.table(ID=c("F","A","E","B","C","D","C"),
num=c(59,3,108,11,22,54,241),
value=c(90,47,189,38,42,86,280),
Mark=c("Mary","Tom","Abner","Norman","Joanne",
"Bonnie","Trista"))
DT2 <- data.table(Mark=c("Mary","Abner","Bonnie","Trista","Norman"),
numA=c(48,20,88,237,20),
numB=c(60,326,54,268,89),
valueA=c(78,34,78,270,59),
valueB=c(90,190,90,385,75))
DToutput <- DT1[(num > DT2$numA & num < DT2$numB &
value > DT2$valueA & value < DT2$valueB)]
My goal : 我的目标 :
I want to find num
and value
based on Mark
in DT1
, and there is a range of numA
and numB
in DT2
. 我想根据DT1
Mark
找到num
和value
, DT2
有一系列numA
和numB
。
For example : 例如 :
For row F
in DT1
, num = 59
and value = 90
, and Mark = "Mary"
. 对于DT1
行F
, num = 59
和value = 90
, Mark = "Mary"
。 So, when using by=Mary
, you must also match: 因此,当使用by=Mary
,您还必须匹配:
num(59) > DT2$numA(48) & num(59) < DT2$numB(60) & value(90) > DT2$valueA(78) & value(90) < DT2$valueB(90)
You can see that 90 < 90
does not hold, so my result will not have row F
你可以看到90 < 90
不成立,所以我的结果不会有F
行
I got this warning: 我收到了这个警告:
Warning messages:
1: In num > DT2$numA : longer object length is not a multiple of shorter object lengt
2: In num < DT2$numB : longer object length is not a multiple of shorter object lengt
3: In value > DT2$valueA : longer object length is not a multiple of shorter object lengt
4: In value < DT2$valueB : longer object length is not a multiple of shorter object lengt
How can I modify it to complete what I want to do? 如何修改它以完成我想要做的事情?
Thank you 谢谢
Added: Multiple identical Marks may be used in DT2, but the values are not the same range. 补充:DT2中可以使用多个相同的标记,但值不是相同的范围。 Does this affect the comparison? 这会影响比较吗?
Another option using non-equi inner join: 使用非equi内连接的另一个选项:
DT2[DT1, on=.(Mark=Mark, numA<num, numB>num, valueA<value, valueB>value), nomatch=0L,
.(ID, num, value, Mark)]
or: 要么:
DT1[DT2, on=.(Mark, num>numA, num<numB, value>valueA, value<valueB), nomatch=0L,
.(ID, num=x.num, value=x.value, Mark)]
output: 输出:
ID num value Mark
1: E 108 189 Abner
2: C 241 280 Trista
Is this generally what you are looking for? 这通常是你在寻找什么? I joined the datatables and filtered using between
for your conditions. 我加入了数据表,并使用过滤between
您的条件。 If this is not what you are looking for, can you post a datatable of your expected output? 如果这不是您想要的,您可以发布预期输出的数据表吗?
library(data.table)
DT1[DT2, on = "Mark"][between(num, numA, numB, incbounds = F) & between(value, valueA, valueB, incbounds = F)]
ID num value Mark numA numB valueA valueB
1: E 108 189 Abner 20 326 34 190
2: C 241 280 Trista 237 268 270 385
EDIT : Benchmark comparison between this approach and the non-equi inner-join from @Chinsoon12 shows that the non-equi inner-join is much faster with even a little more data. 编辑 :这种方法与@ Chinsoon12的非equi内连接之间的基准比较表明,即使是更多的数据,非equi内连接也要快得多。 It is not a perfect benchmark (just repeated the data.table
), but I still think it is clear that the non-equi inner-join is much more efficient. 它不是一个完美的基准(只是重复data.table
),但我仍然认为很明显非equi内连接效率更高。
Unit: milliseconds
expr min lq mean median uq max neval
between 233.6378 265.4323 303.14039 301.82455 334.3225 373.2760 10
non_equi_inner 71.6925 74.1547 96.96584 91.14375 97.6664 179.9907 10
Benchmark code: 基准代码:
DT1 <- data.table(sapply(DT1, rep, 1e3))[, c("num", "value") := lapply(.SD, as.integer), .SDcols = c("num", "value")]
DT2 <- data.table(sapply(DT2, rep, 1e3))[, c("numA", "numB", "valueA", "valueB") := lapply(.SD, as.integer), .SDcols = c("numA", "numB", "valueA", "valueB")]
microbenchmark::microbenchmark(
between = {
DT1[DT2, on = "Mark", allow.cartesian = T][between(num, numA, numB, incbounds = F) & between(value, valueA, valueB, incbounds = F)]
},
non_equi_inner = {
DT1[DT2, on=.(Mark, num>numA, num<numB, value>valueA, value<valueB), nomatch=0L,
.(ID, num=x.num, value=x.value, Mark), allow.cartesian = T]
},
times = 10
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.