简体   繁体   中英

efficient way to store lists within a dataframe

I need to be able to compute pairwise intersection of lists, close to 40k. Specifically, I want to know if I can store vector id as column 1, and a list of its values in column 2. I should be able to process this column 2 , ie find overlap/intersections between two rows.

column 1  column 2
idA       1,2,5,9,10
idB       5,9,25
idC       2,25,67

I want to be able to get the pairwise intersection values and also, if the values in column 2 are not already sorted, that should also be possible.

What is the best datastructure that I can use if I am going ahead with R? My data originally looks like this:

column1 1 2 3 9 10 25 67 5
idA     1 1 0 1  1  0  0 1
idB     0 0 0 1  0  1  0 1
idC     0 1 0 0  0  1  1 0

edited to include more clarity as per the suggestions below.

I'd keep the data in a logical matrix:

DF <- read.table(text = "column1 1 2 3 9 10 25 67 5
idA     1 1 0 1  1  0  0 1
idB     0 0 0 1  0  1  0 1
idC     0 1 0 0  0  1  1 0", header = TRUE, check.names = FALSE)

#turn into logical matrix
m <- as.matrix(DF[-1])
rownames(m) <- DF[[1]]
mode(m) <- "logical"

#if you can, create your data as a sparse matrix to save memory
#if you already have a dense data matrix, keep it that way
library(Matrix)
M <- as(m, "lMatrix")

#calculate intersections 
#does each comparison twice
intersections <- simplify2array(
  lapply(seq_len(nrow(M)), function(x) 
    lapply(seq_len(nrow(M)), function(x, y) colnames(M)[M[x,] & (M[x,] == M[y,])], x = x)
  )
)

This double loop could be optimized. I'd do it in Rcpp and create a long format data.frame instead of a list matrix. I'd also do each comparison only once (eg, only the upper triangle).

colnames(intersections) <- rownames(intersections) <- rownames(M)
#    idA         idB         idC        
#idA Character,5 Character,2 "2"        
#idB Character,2 Character,3 "25"       
#idC "2"         "25"        Character,3

intersections["idA", "idB"]
#[[1]]
#[1] "9" "5"

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM