简体   繁体   English

有没有一种更有效的内存使用方式,可以使用combn从R中的其他每一列中减去每一列?

[英]Is there a more memory-efficient way to use combn to subtract every column from every other column in R?

I am trying to subtract each column from each other column in a large R data.table, that has 13125 columns and 90 rows. 我正在尝试从具有13125列和90行的大型R data.table中的每一列中减去每一列。

I am following up on a previous question which addresses this for data.tables of smaller dimensions ( Subtract every column from each other column in a R data.table ). 我正在跟踪先前的问题,该问题针对较小尺寸的data.tables解决此问题( 从R data.table中的其他列中减去每一列 )。

My problem is that I am currently running out of memory to generate the resulting data.table of column combinations (which seems to require 59.0GB). 我的问题是我当前内存不足以生成列组合的data.table结果(似乎需要59.0GB)。

My question is, is there a more memory-efficient way to calculate the column differences with combn or perhaps another function for larger datasets? 我的问题是,是否有一种更高效的内存计算方式,可以使用combin或大型数据集的另一个函数来计算列差异?

The code I have been using is: 我一直在使用的代码是:

# I have a data.table of 13125 columns and 90 rows, called data. 

# use combn to generate all possible pairwise column combinations (column + column),
# then within this apply a function to subtract the column value from its paired column value.
# this is done for each row, to produce a new datatable called res.

res <- as.data.table(combn(colnames(data), 2, function(x) data[[x[1]]] - data[[x[2]]]))

# take the pairwise column combinations and paste the pairing as the new column name

colnames(res) <- combn(colnames(data), 2, paste, collapse="_")

I apologise if this question is too similar and therefore considered a duplication. 如果这个问题太相似,因此我认为是重复的,我深表歉意。 I would be very grateful for any advice with how to improve the efficiency of this code for the scale of my data. 我将非常感谢您提供有关如何针对我的数据规模提高此代码效率的任何建议。

As per OP's comment regarding the next step after differencing columns, it will be more memory compact if you also square and sum the column totals during the calculation so that you will only have a vector with 13,125 elements as a result rather than storing 13,125*90*90 numeric subtracted values. 根据OP关于在区分列之后进行下一步操作的评论,如果在计算过程中也对列的总数求平方并求和,则将使内存更紧凑,这样结果将只有一个包含13,125个元素的向量,而不是存储13,125 * 90 * 90个数字相减值。 A fast and possible approach is to use RcppArmadillo : 一种快速可行的方法是使用RcppArmadillo

colpairs.cpp (by no means the only implementation): colpairs.cpp (绝不是唯一的实现):

// [[Rcpp::depends(RcppArmadillo)]]
#include <RcppArmadillo.h>
using namespace Rcpp;
using namespace arma;

// [[Rcpp::export]]
rowvec colpairs(mat Z) {
    unsigned int i, j, k = Z.n_cols;
    colvec vi, vj, y;
    rowvec res(k);

    for (i=0; i<k; i++) {
        vi = Z.col(i);
        res[i] = 0;
        for (j=0; j<k; j++) {
            vj = Z.col(j);
            y = vi - vj;
            res[i] += as_scalar(y.t() * y);
        }
    }

    return res;
}

In R: 在R中:

library(Rcpp)
library(RcppArmadillo)
sourceCpp("colpairs.cpp")

# #use a small matrix to check results
# set.seed(0L)
# nc <- 3; nr <- 3; M <- matrix(rnorm(nr*nc), ncol=nc)
# c(sum((M[,1]-M[,2])^2 + (M[,1]-M[,3])^2), sum((M[,3]-M[,2])^2 + (M[,2]-M[,3])^2), sum((M[,3]-M[,1])^2 + (M[,2]-M[,3])^2))
# colpairs(M)

set.seed(0L)
nc <- 13125
nr <- 90
M <- matrix(rnorm(nr*nc), ncol=nc)
colpairs(M)

trunc. unc output: 输出:

[1] 2105845 2303591 2480945 2052415 2743199 2475948 2195874 2122436 2317515  .....

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM