R 用于搜索大型数据集的 for 循环的替代方案

Question

The goal here is to identify and count if the entries in column b have matching entries in column a with a range of +/-1 (or as required).这里的目标是识别和计数 b 列中的条目是否与 a 列中的匹配条目的范围为 +/-1（或根据需要）。 A simplified version is provided:提供了一个简化版本：

a <- c("1231210","1231211", "1231212", "98798", "98797", "98796", "555125", "555127","555128")
b <- c("1", "2", "3", "4", "5", "6", "1231209", "98797", "555126")
df <- data.frame(a, b)

I merged this data in a dataframe to simulate my actual dataset, converted them to numerics and wrote the following function to get my desired output. (note: column a need not be part of the df, but can be a separate list I suppose?)我将这些数据合并到 dataframe 中以模拟我的实际数据集，将它们转换为数字并编写以下 function 以获得我想要的 output。（注意：列 a 不必是 df 的一部分，但我想可以是一个单独的列表？ )

df$c <- mapply(
function(x){
    count = 0
    for (i in df$a){
        if (abs(i-x) <= 1){
            count = count +1
        }
    }
    paste0(count)
},
df$b
)

	a A	b b	c c
1 1个	1231210 1231210	1 1个	0 0
2 2个	1231211 1231211	2 2个	0 0
3 3个	1231212 1231212	3 3个	0 0
4 4个	98798 98798	4 4个	0 0
5 5个	98797 98797	5 5个	0 0
6 6个	98796 98796	6 6个	0 0
7 7	555125 555125	1231209 1231209	1 1个
8 8个	555127 555127	98797 98797	3 3个
9 9	555128 555128	555126 555126	2 2个

While this appears to work fine for the trial dataset, my actual dataset has over 2 million rows which means 2M^2 iterations?虽然这对于试用数据集似乎工作正常，但我的实际数据集有超过200 万行，这意味着 2M^2 次迭代？ (still running after 3h) I was wondering if there is an alternate strategy to tackle this, preferably using base R functions only. （3 小时后仍在运行）我想知道是否有替代策略来解决这个问题，最好只使用基本 R 函数。

I'm quite new to R and a common suggestion is to use vectorization to improve efficiency.我对 R 很陌生，一个常见的建议是使用矢量化来提高效率。 However, I have no clue if this is possible in this case when looking at the examples provided on the.net.但是，在查看 .net 上提供的示例时，我不知道在这种情况下是否可行。

Would love to hear any suggestions and feel free to point out mistakes.很乐意听到任何建议，并随时指出错误。 Thanks!谢谢！

Answer 1

why are vectors a and b characters?为什么向量a和b字符？ They should be numeric :它们应该是numeric ：

a <- c(1231210,1231211, 1231212, 98798, 98797, 98796, 555125, 555127,555128)
b <- c(1, 2, 3, 4, 5, 6, 1231209, 98797, 555126)

You can simplify by using only one loop and vectorization:您可以通过仅使用一个循环和矢量化来简化：

unlist(lapply(b, function(x) sum(abs(a-x) <= limit)))

where limit is variable describing allowed difference.其中limit是描述允许差异的变量。 For limit <- 1 you get:对于limit <- 1你得到：

 [1] 0 0 0 0 0 0 1 3 2

Answer 2

What about colSums + outer ? colSums + outer怎么样？

transform(
  type.convert(data.frame(a, b), as.is = TRUE),
  C = colSums(abs(outer(a, b, `-`)) <= 1)
)

output output

        a       b C
1 1231210       1 0
2 1231211       2 0
3 1231212       3 0
4   98798       4 0
5   98797       5 0
6   98796       6 0
7  555125 1231209 1
8  555127   98797 3
9  555128  555126 2

Answer 3

As your data is quite large, outer and lapply approaches will be quite slow (for outer you need 14901.2 Gb of RAM).由于您的数据非常大， outer和lapply方法将非常慢（对于outer您需要 14901.2 Gb 的 RAM）。 I suggest using data.table我建议使用data.table

require(data.table)
dt <- as.data.table(df)

dt[, id := 1:.N] # add id as maybe you have duplicated values
setkey(dt, id)
dt[, b1 := b - 1L]
dt[, b2 := b + 1L]
x <- dt[dt, on = .(a >= b1, a <= b2)] # non-equi join
x <- x[, .(c = sum(!is.na(b1))), keyby = .(id = i.id)]
dt[x, c := i.c, on = 'id']
dt
#          a       b id      b1      b2 c
# 1: 1231210       1  1       0       2 0
# 2: 1231211       2  2       1       3 0
# 3: 1231212       3  3       2       4 0
# 4:   98798       4  4       3       5 0
# 5:   98797       5  5       4       6 0
# 6:   98796       6  6       5       7 0
# 7:  555125 1231209  7 1231208 1231210 1
# 8:  555127   98797  8   98796   98798 3
# 9:  555128  555126  9  555125  555127 2

dt[, id := NULL][, b1 := NULL][, b2 := NULL] # remove colls

ps check that a and b are converted to integers before... ps 检查a和b之前是否转换为整数...

R 用于搜索大型数据集的 for 循环的替代方案

问题描述

3 个解决方案

解决方案1
1 2021-08-17 08:49:37

解决方案2
1 2021-08-17 08:56:10

output output

解决方案3
1 已采纳 2021-08-17 09:24:28

R 用于搜索大型数据集的 for 循环的替代方案

问题描述

3 个解决方案

解决方案1 1 2021-08-17 08:49:37

解决方案2 1 2021-08-17 08:56:10

output output

解决方案3 1 已采纳 2021-08-17 09:24:28

解决方案1
1 2021-08-17 08:49:37

解决方案2
1 2021-08-17 08:56:10

解决方案3
1 已采纳 2021-08-17 09:24:28