简体   繁体   中英

vectorially finding neighbors to data frame rows

I have two data frames, below is a small sample of each:

df1 <- data.frame(a1= c(3,4), a2 = c(8, 8), a3 = c(4, 18), a4 = c(9,9), a5 = c(17, 30))

df2 <- data.frame(a1 = c(2,2,2,3,3,3,4,4,4), a2 = c(7,7,7,7,7,7,7,7,7), 
                 a3 = c(4,4,4,4,4,4,4,4,4), a4 = c(10,10,10, 10, 10, 10, 10,10,10), 
                 a5 = c(15,16,17, 15, 16, 17, 15, 16, 17))

I would like to examine, for each row of df1 , whether it has "neighbors" in df2 , where, by neighbors I mean observations that are different by at most 1 in each column (in absolute value). So for example, row 2 of df2 is a neighbor of row 1 in df1 .

The way I currently do this is the following:

sweep(as.matrix(df2), 2, as.matrix(df1[1,]), "-")

For row 1 of df1 , and I have to repeat this for each row of df1. Note that df2 and df1 do not have the same number of rows.

However, what I would really like is to avoid doing this "by row", because my data frames have many rows. Is there a way to do it vectorially?

You can use split your row of df1 into a list, and then use lapply to achieve Vectorization:

my_list=lapply(as.list(data.frame(t(df1))),function(x) sweep(as.matrix(df2), 2, as.matrix(x), "-"))

each element of my_list is the computation result of each row in df1

my_list[[1]]
      a1 a2 a3 a4 a5
 [1,] -1 -1  0  1 -2
 [2,] -1 -1  0  1 -1
 [3,] -1 -1  0  1  0
 [4,]  0 -1  0  1 -2
 [5,]  0 -1  0  1 -1
 [6,]  0 -1  0  1  0
 [7,]  1 -1  0  1 -2
 [8,]  1 -1  0  1 -1
 [9,]  1 -1  0  1  0

Also, you can use parallel::mclapply which is faster than traditional lapply

Here is a possible data.table approach using non-equi joins

library(data.table)
cols <- names(df2)

#convert into data.table and add row index for clarity
setDT(df1)[, rn1 := .I]
setDT(df2)[, rn2 := .I]

#create a lower (-1) and upper (+1) bound on each column
bandsNames <- paste0(rep(cols, each=2L), "_", rep(c("lower", "upper"), length(cols)))
df2Bands <- df2[, 
    {
        ans <- do.call(cbind, lapply(.SD, function(x) outer(x, c(-1L, 1L), `+`)))
        setnames(data.table(ans), bandsNames)
    }, by=.(rn2)]

#create the non-equi join conditions
lowerLimits <- paste0(cols, "_lower<=", cols)
upperLimits <- paste0(cols, "_upper>=", cols)

#perform the non-equi join on lower and upper limits and return the count
#`:=` add a new column in df1 by reference
df1[, Count := 
        df2Bands[df1, .N, by=.EACHI, on=c(lowerLimits, upperLimits)]$N
    ]

desired output:

   a1 a2 a3 a4 a5 rn1 Count
1:  3  8  4  9 17   1     6
2:  4  8 18  9 30   2     0

If you want to find the matching rows as well:

df2Bands[df1, .(rn1=i.rn1, rn2=x.rn2), by=.EACHI, on=c(lowerLimits, upperLimits)][, 
    -(1L:length(bandsNames))]

Matched rows:

   rn1 rn2
1:   1   2
2:   1   3
3:   1   5
4:   1   6
5:   1   8
6:   1   9
7:   2  NA

I do not think there is a good way to fully vectorise this problem, (apply family are really just for loops in a bow tie). But you can do it on a by column basis, rather than by row. If further improvement is required the size of the problem can be reduced after each column by removing rows that can be excluded from ever matching (this will cause an indexing headache, but is relatively do-able).

My attempt is below which uses a for loop (which could be replaced by lapply). It returns a truth matrix, rows with a 1 can be matched to columns with a 1, which gives the pairing of neighbours.

col_comp = function(x,y)
{
    lx = length(x)
    ly = length(y)
    return(abs(rep(x,ly) - rep(y,each = lx) )<=1)
}

full_comp=function(df1,df2)
{
    rows1 = seq_len(nrow(df1))
    rows2 = seq_len(nrow(df2))
    M = matrix(1L, nrow=nrow(df1),ncol=nrow(df2))
    for(i in seq_len(ncol(df1)) )
    {
        matches = col_comp(df1[rows1,i],df2[rows2,i])

        M = M*matches
    }
    return(M)
}

A solution using library(sqldf) :

library(sqldf)

sqldf( "select df2.*, df1.rowid as df1_idx
        from df2 left join df1
           on df2.a1 between df1.a1-1 and df1.a1+1
          and df2.a2 between df1.a2-1 and df1.a2+1
          and df2.a3 between df1.a3-1 and df1.a3+1
          and df2.a4 between df1.a4-1 and df1.a4+1
          and df2.a5 between df1.a5-1 and df1.a5+1")

  a1 a2 a3 a4 a5 df1_idx
1  2  7  4 10 15     NA
2  2  7  4 10 16      1
3  2  7  4 10 17      1
4  3  7  4 10 15     NA
5  3  7  4 10 16      1
6  3  7  4 10 17      1
7  4  7  4 10 15     NA
8  4  7  4 10 16      1
9  4  7  4 10 17      1

Edit to show solution for any number of columns:

library(sqldf)

cnames <- colnames(df1)

# main body of your sql
sql_main <- "select df2.*, df1.rowid as df1_idx
            from df2 left join df1
            on 1=1"

# join conditions (which will be added to above)
join_conditions <- 
  paste0( ' and df2.', cnames, ' BETWEEN df1.', cnames, '-1',
                                   ' AND df1.', cnames, '+1',
          collapse = '')

sql <- paste(sql_main, join_conditions)

sqldf(sql)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM