Memory 在 R 中使用 sapply() 的有效方法

Question

I am trying to reduce the memory consumption of a piece of R code I have been working on.我正在尝试减少我一直在处理的一段 R 代码的 memory 消耗。 I am using the peakRAM() function to measure the maximum RAM used.我正在使用peakRAM() function 来测量使用的最大 RAM。 It is a long code and there is a simple sapply() function at the end of it.这是一个很长的代码，最后有一个简单的sapply() function。 I figured out that it is the sapply() part which is consuming the maximum memory.我发现是sapply()部分消耗了最大的 memory。 So I have written a small function fun1() imitating the objects and the sapply() function from that part of my code, which is as follows:所以我从我的代码的那部分编写了一个小的 function fun1()模仿对象和sapply() function ，如下所示：

library(peakRAM)
fun1 <- function() {
  tm <- matrix(1, nrow = 300, ncol = 10)  #in the original code, the entries are different and nonzero
  print(object.size(tm))
  r <- sapply(1:20000, function(i) {
        colSums(tm[1:200,])  #in the original code, I am subsetting a 200 length vector which varies with i, stored in a list of length 20000
        })
  print(object.size(r))
  r
}

peakRAM(fun1())

If you run this in R, you get a peakRAM() consumption of around 330Mb.如果你在 R 中运行它，你会得到大约 330Mb 的peakRAM()消耗。 But you can see that the two objects tm and r are both of very small size (2Kb and 1.6Mb respectively) and if you look at the peakRAM() for computing a single colSums(tm[1:200,]) , it is very small, like 0.1Mb.但是您可以看到这两个对象tm和r的大小都非常小（分别为 2Kb 和 1.6Mb），如果您查看用于计算单个colSums(tm[1:200,])的peakRAM() () ，它是非常小，例如 0.1Mb。 So it feels like, during sapply() , R is probably not getting rid of the memory while looping over 1:20000 .所以感觉就像，在sapply()期间， R 在循环1:20000时可能没有摆脱 memory 。 Otherwise, since a single colSums(tm[1:200,]) takes very small memory, and all the objects associated are of small memory, the sapply() should have taken small memory.否则，由于单个colSums(tm[1:200,])占用非常小的 memory，并且所有关联的对象都是小的 memory，因此sapply()应该占用很小的 ZCD69B4957F06CD8191Z73。

In this regard, I already know that R has a gc() function which gets rid of unnecessary memory when needed and probably R is not clearing memory during sapply() which is resulting into this high memory consumption. In this regard, I already know that R has a gc() function which gets rid of unnecessary memory when needed and probably R is not clearing memory during sapply() which is resulting into this high memory consumption. If that is true, I would like to know if there is a way to get rid of this and complete the job without requiring this much extra memory?如果这是真的，我想知道是否有办法摆脱这个并完成工作而不需要这么多额外的 memory？ Note that, I do not wish to compromise on the run-time for doing that.请注意，我不希望为此在运行时妥协。

Answer 1

Here is your function modified to use vapply instead of sapply and .colSums instead of colSums :这是您的 function 修改为使用vapply而不是sapply和.colSums而不是colSums ：

f1 <- function(x, l) {
    d <- dim(x)
    m <- d[1L]
    n <- d[2L]
    FUN <- function(i) .colSums(x[i, , drop = FALSE], m, n)
    vapply(l, FUN, double(n))
}

Here is a C implementation, made accessible to R via the inline package:这是一个 C 实现，R 通过inline package 可以访问：

sig <- c(x = "double", l = "list")
bod <- '
double *px = REAL(x);
R_xlen_t nx = xlength(x);
int *d = INTEGER(getAttrib(x, R_DimSymbol));
int m = d[0], n = d[1], N = length(l);

SEXP res = PROTECT(allocMatrix(REALSXP, n, N));
double *pres = REAL(res);

SEXP index;
R_xlen_t nindex;
int *pindex;

double sum;

for (int i = 0, rpos = 0; i < N; ++i)
{
    index = VECTOR_ELT(l, i);
    nindex = xlength(index);
    pindex = INTEGER(index);
    for (int xpos = 0; xpos < nx; xpos += m, ++rpos)
    {
        sum = 0.0;
        for (int k = 0; k < nindex; ++k)
        {
            sum += px[xpos + pindex[k] - 1];
        }
        pres[rpos] = sum;
    }
}
UNPROTECT(1);
return res;
'
f2 <- inline::cfunction(sig, bod, language = "C")

And here is a test showing that f1 and f2 give identical results, where I have used a 300-by-10 double matrix and a length-20000 list of length-200 index vectors:这是一个测试，显示f1和f2给出相同的结果，其中我使用了一个 300×10 双矩阵和一个长度为 20000 的长度为 200 的索引向量列表：

set.seed(1L)
m <- 300L
n <- 10L
x <- matrix(rnorm(m * n), m, n)
l <- replicate(2e+04, sample(m, size = 200L, replace = TRUE), simplify = FALSE)
identical(f1(x, l), f2(x, l))
## [1] TRUE

If you profile f1(x, l) and f2(x, l) using Rprof and summaryRprof (or maybe your peakRAM , though I've never used it), then you will find that f2 is both faster and more efficient.如果您使用Rprof和summaryRprof （或者也许您的peakRAM ，尽管我从未使用过）来分析f1(x, l)和f2(x, l) ，那么您会发现f2更快更高效。

Note that I've only used the R API in my C code.请注意，我在 C 代码中只使用了R API 。 You may find Rcpp to be more approachable, in which case you are encouraged to implement a C++ equivalent to my f2 based on Rcpp .您可能会发现Rcpp更平易近人，在这种情况下，我们鼓励您基于 Rcpp 实现与我的f2等效的Rcpp 。

Memory 在 R 中使用 sapply() 的有效方法

问题描述

1 个解决方案

解决方案1
1 2022-02-03 03:10:39

Memory 在 R 中使用 sapply() 的有效方法

问题描述

1 个解决方案

解决方案1 1 2022-02-03 03:10:39

解决方案1
1 2022-02-03 03:10:39