简体   繁体   中英

Reading large flat file of x,y,z into table of row names x, column names y, and values z

Recently I have started using R and I want to use it to convert a large file of conditional probabilities into a distance matrix based on the variation of information (see: https://en.wikipedia.org/wiki/Variation_of_information and https://en.wikipedia.org/wiki/Mutual_information ) To do this and I find that I need to read in a fairly large flat file (~35GB)of conditional probabilities whose contents are:

     1      7979  1
     2     23243  0
     23243     1  0.343
     ......

And so on. What I want to do is read the data and reshape in such a way that I have a table (or matrix) that has:

        1  2  ... 7979 ... 23243 ...
 1      z  z   z   1   z    z ... 
 2      z  z   z   z   z    0 ...
...     z  z   z   z   z    z ...
7979    z  z   z   z   z    z ...
...     z  z   z   z   z    z ...
23243  0.343 0   z   z   z    z ...

where the z's are the third column of the flat file. Something to consider:

1) most of the values in the third column of the flat file are 0.

2) The resulting table is square, with each row being about 50,000 entries.

3) Once I have the table loaded, each row must be summed multiple times, once for all elements, and (#rows-1)^2 times with one column being left out in each additional summation.

Any ideas would be great. The only thought I have had so far is to remove all of the lines from the flat file that have the third column equal to zero in a preprocessing step (awk does this just fine) and then try to find a package to create a sparse matrix from the flat file and convert that to a dense matrix on the fly within R for the computation, but I haven't had much luck (I know dummy.matrix does something like this but I am not sure how to use it).

Sample data

Creating a data frame with only non-zero z values (suppose we can remove all of the zero lines from the flat file before importing data).

N <- 50000
S <- N * 0.8 
df_input <- data.frame( x = sample(1:N, S), y = sample(1:N, S), z = runif(S))

# > head(df_input)
#      v1    v2     value
# 1 35093 13107 0.6078230
# 2 46104  5201 0.1596800
# 3 21262  1943 0.9006491
# 4 10250 21508 0.6725270
# 5 41243 33452 0.7160704
# 6 17123 45607 0.5535252

Creating a matrix

With the Matrix package we can represent sparse matrices:

# create sparse matrix
library(Matrix)
M1 <- sparseMatrix(i = df_input[,1], j = df_input[,2], x = df_input$z, dims = c(N,N))

# > dim(M1)
# [1] 50000 50000

Calculate sums

With smaller matrices we would normally do something like this:

# sum rows with i-th column excluded 
# *warning: you need a memory for N*(N+1) matrix!*
result <- sapply(1:(N + 1), FUN = function(i) {
  rowSums(M1[,-i])
})

But it might not be possible to create N x (N+1) matrix in a memory. M1 is sparse but resulting N x (N+1) matrix is full of sum values. Now what?

Well, it depends on how the sums will be used. You can always get the sums of rows with excluded column from the source M1 sparse matrix:

rsums <- function(M1, col_num) rowSums(M1[,-col_num])

The sums without i -th column:

rsums(M1, i)

The sum of the j -th row without i -th column:

rsums(M1, i)[j]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM