简体   繁体   中英

Is there a more memory-efficient way to use combn to subtract every column from every other column in R?

I am trying to subtract each column from each other column in a large R data.table, that has 13125 columns and 90 rows.

I am following up on a previous question which addresses this for data.tables of smaller dimensions ( Subtract every column from each other column in a R data.table ).

My problem is that I am currently running out of memory to generate the resulting data.table of column combinations (which seems to require 59.0GB).

My question is, is there a more memory-efficient way to calculate the column differences with combn or perhaps another function for larger datasets?

The code I have been using is:

# I have a data.table of 13125 columns and 90 rows, called data. 

# use combn to generate all possible pairwise column combinations (column + column),
# then within this apply a function to subtract the column value from its paired column value.
# this is done for each row, to produce a new datatable called res.

res <- as.data.table(combn(colnames(data), 2, function(x) data[[x[1]]] - data[[x[2]]]))

# take the pairwise column combinations and paste the pairing as the new column name

colnames(res) <- combn(colnames(data), 2, paste, collapse="_")

I apologise if this question is too similar and therefore considered a duplication. I would be very grateful for any advice with how to improve the efficiency of this code for the scale of my data.

As per OP's comment regarding the next step after differencing columns, it will be more memory compact if you also square and sum the column totals during the calculation so that you will only have a vector with 13,125 elements as a result rather than storing 13,125*90*90 numeric subtracted values. A fast and possible approach is to use RcppArmadillo :

colpairs.cpp (by no means the only implementation):

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
rowvec colpairs(mat Z) {
    unsigned int i, j, k = Z.n_cols;
    colvec vi, vj, y;
    rowvec res(k);

    for (i=0; i<k; i++) {
        vi = Z.col(i);
        res[i] = 0;
        for (j=0; j<k; j++) {
            vj = Z.col(j);
            y = vi - vj;
            res[i] += as_scalar(y.t() * y);
        }
    }

    return res;
}

In R:

library(Rcpp)
library(RcppArmadillo)
sourceCpp("colpairs.cpp")

# #use a small matrix to check results
# set.seed(0L)
# nc <- 3; nr <- 3; M <- matrix(rnorm(nr*nc), ncol=nc)
# c(sum((M[,1]-M[,2])^2 + (M[,1]-M[,3])^2), sum((M[,3]-M[,2])^2 + (M[,2]-M[,3])^2), sum((M[,3]-M[,1])^2 + (M[,2]-M[,3])^2))
# colpairs(M)

set.seed(0L)
nc <- 13125
nr <- 90
M <- matrix(rnorm(nr*nc), ncol=nc)
colpairs(M)

trunc. output:

[1] 2105845 2303591 2480945 2052415 2743199 2475948 2195874 2122436 2317515  .....

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM