简体   繁体   English

当需要考虑机器公差时,按数字列合并 data.tables

[英]Merging data.tables by numeric column when machine tolerance needs to be accounted for

Many have seen the issue with using == to compare to floating point numbers.许多人已经看到使用==与浮点数进行比较的问题。 == fails to return TRUE but all.equal works. ==无法返回TRUEall.equal有效。

x <- sqrt(2)
x^2 == 2
#> [1] FALSE
all.equal(x^2, 2)
#> [1] TRUE

My issue comes from the need to join to data.table s by a numeric column where == will fail to find the matching pairs.我的问题来自需要通过数字列加入data.table s,其中==将无法找到匹配对。

I have considered coercing the numeric values to characters, but that option has too many other potiential errors.我考虑过将数值强制转换为字符,但该选项有太多其他潜在错误。 I have considered rounding the values, but that to, in the application I need, will create more problems.我考虑过对值进行四舍五入,但是在我需要的应用程序中,这会产生更多问题。

Here is simple example of a join that is failing because DT1$x == DT2$x will return FALSE when it would be preferable to have the return be TRUE .这是一个连接失败的简单示例,因为DT1$x == DT2$x将返回FALSE ,而最好返回TRUE

library(data.table)
packageVersion("data.table")
#> [1] '1.12.8'

DT1 <- data.table(x = sqrt(1:10), v1 = 1:10)
DT2 <- data.table(x = 1:10, v2 = LETTERS[1:10])

# set x to its square
DT1[, x := x^2]

# left join
merge(DT1, DT2, by = "x", all.x = TRUE)
#>      x v1   v2
#>  1:  1  1    A
#>  2:  2  2 <NA>
#>  3:  3  3 <NA>
#>  4:  4  4    D
#>  5:  5  5 <NA>
#>  6:  6  6 <NA>
#>  7:  7  7 <NA>
#>  8:  8  8 <NA>
#>  9:  9  9    I
#> 10: 10 10 <NA>

How can I specify a left join by a numeric column key such that the machine tolerance in the comparison is accounted for?如何通过数字列键指定左连接,以便考虑比较中的机器公差? Created on 2020-04-06 by the reprex package (v0.3.0)代表 package (v0.3.0) 于 2020 年 4 月 6 日创建

You could use roll = "nearest" .您可以使用roll = "nearest" Note that only the last column specified in on = can be rolling.请注意,只有on =中指定的最后一列可以滚动。

library(data.table)
DT1[DT2,on = "x", roll = "nearest"]
    x v1 v2
 1:  1  1  A
 2:  2  2  B
 3:  3  3  C
 4:  4  4  D
 5:  5  5  E
 6:  6  6  F
 7:  7  7  G
 8:  8  8  H
 9:  9  9  I
10: 10 10  J

I suspect the problem is more complicated than this simple case, but you could subsequently filter joins that do not meet a certain threshold of difference.我怀疑这个问题比这个简单的情况更复杂,但是您可以随后过滤不满足特定差异阈值的连接。

Data数据

DT1 <- data.table(x = sqrt(1:10), v1 = 1:10)
DT2 <- data.table(x = 1:10, v2 = LETTERS[1:10])
DT1[, x := x^2]

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM