简体   繁体   中英

R Alternatives to a for loop for searching through a large dataset

The goal here is to identify and count if the entries in column b have matching entries in column a with a range of +/-1 (or as required). A simplified version is provided:

a <- c("1231210","1231211", "1231212", "98798", "98797", "98796", "555125", "555127","555128")
b <- c("1", "2", "3", "4", "5", "6", "1231209", "98797", "555126")
df <- data.frame(a, b)

I merged this data in a dataframe to simulate my actual dataset, converted them to numerics and wrote the following function to get my desired output. (note: column a need not be part of the df, but can be a separate list I suppose?)

df$c <- mapply(
function(x){
    count = 0
    for (i in df$a){
        if (abs(i-x) <= 1){
            count = count +1
        }
    }
    paste0(count)
},
df$b
)
a b c
1 1231210 1 0
2 1231211 2 0
3 1231212 3 0
4 98798 4 0
5 98797 5 0
6 98796 6 0
7 555125 1231209 1
8 555127 98797 3
9 555128 555126 2

While this appears to work fine for the trial dataset, my actual dataset has over 2 million rows which means 2M^2 iterations? (still running after 3h) I was wondering if there is an alternate strategy to tackle this, preferably using base R functions only.

I'm quite new to R and a common suggestion is to use vectorization to improve efficiency. However, I have no clue if this is possible in this case when looking at the examples provided on the.net.

Would love to hear any suggestions and feel free to point out mistakes. Thanks!

why are vectors a and b characters? They should be numeric :

a <- c(1231210,1231211, 1231212, 98798, 98797, 98796, 555125, 555127,555128)
b <- c(1, 2, 3, 4, 5, 6, 1231209, 98797, 555126)

You can simplify by using only one loop and vectorization:

unlist(lapply(b, function(x) sum(abs(a-x) <= limit)))

where limit is variable describing allowed difference. For limit <- 1 you get:

 [1] 0 0 0 0 0 0 1 3 2

What about colSums + outer ?

transform(
  type.convert(data.frame(a, b), as.is = TRUE),
  C = colSums(abs(outer(a, b, `-`)) <= 1)
)

output

        a       b C
1 1231210       1 0
2 1231211       2 0
3 1231212       3 0
4   98798       4 0
5   98797       5 0
6   98796       6 0
7  555125 1231209 1
8  555127   98797 3
9  555128  555126 2

As your data is quite large, outer and lapply approaches will be quite slow (for outer you need 14901.2 Gb of RAM). I suggest using data.table

require(data.table)
dt <- as.data.table(df)

dt[, id := 1:.N] # add id as maybe you have duplicated values
setkey(dt, id)
dt[, b1 := b - 1L]
dt[, b2 := b + 1L]
x <- dt[dt, on = .(a >= b1, a <= b2)] # non-equi join
x <- x[, .(c = sum(!is.na(b1))), keyby = .(id = i.id)]
dt[x, c := i.c, on = 'id']
dt
#          a       b id      b1      b2 c
# 1: 1231210       1  1       0       2 0
# 2: 1231211       2  2       1       3 0
# 3: 1231212       3  3       2       4 0
# 4:   98798       4  4       3       5 0
# 5:   98797       5  5       4       6 0
# 6:   98796       6  6       5       7 0
# 7:  555125 1231209  7 1231208 1231210 1
# 8:  555127   98797  8   98796   98798 3
# 9:  555128  555126  9  555125  555127 2

dt[, id := NULL][, b1 := NULL][, b2 := NULL] # remove colls

ps check that a and b are converted to integers before...

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM