简体   繁体   English

R bigmemory总是使用后备文件?

[英]R bigmemory always use backing file?

We are trying to use the BigMemory library with foreach to parallel our analysis. 我们正在尝试将BigMemory库与foreach一起使用以平行我们的分析。 However, the as.big.matrix function seems always use backingfile. 但是,as.big.matrix函数似乎总是使用backingfile。 Our workstations have enough memory, is there a way to use bigMemory without the backing file? 我们的工作站有足够的内存,有没有办法在没有后备文件的情况下使用bigMemory?

This code x.big.desc <-describe(as.big.matrix(x)) is pretty slow as it write the data to C:\\ProgramData\\boost_interprocess\\ . 此代码x.big.desc <-describe(as.big.matrix(x))非常慢,因为它将数据写入C:\\ProgramData\\boost_interprocess\\ Somehow it is slower than save x directly, is it as.big.matrix that have a slower I/O? 不知怎的,它比直接保存x慢,是as.big.matrix有更慢的I / O?

This code x.big.desc <-describe(as.big.matrix(x, backingfile = "")) is pretty fast, however, it will also save a copy of the data to %TMP% directory. 此代码x.big.desc <-describe(as.big.matrix(x, backingfile = ""))非常快,但是,它还会将数据副本保存到%TMP%目录。 We think the reason it is fast, because R kick off a background writing process, instead of actually writing the data. 我们认为它很快的原因,因为R启动了后台编写过程,而不是实际编写数据。 (We can see the writing thread in TaskManager after the R prompt returns). (我们可以在R提示符返回后看到TaskManager中的写入线程)。

Is there a way to use BigMemory with RAM only, so that each worker in foreach loop can access the data via RAM? 有没有办法只将BigMemory与RAM一起使用,以便foreach循环中的每个worker都可以通过RAM访问数据?

Thanks for the help. 谢谢您的帮助。

So, if you have enough RAM, just use standard R matrices. 因此,如果您有足够的RAM,只需使用标准R矩阵。 To pass only a part of each matrix to each cluster, use rdsfiles. 要仅将每个矩阵的一部分传递给每个群集,请使用rdsfiles。

One example computing the colSums with 3 cores: 计算具有3个核心的colSums一个示例:

# Functions for splitting
CutBySize <- function(m, nb) {
  int <- m / nb

  upper <- round(1:nb * int)
  lower <- c(1, upper[-nb] + 1)
  size <- c(upper[1], diff(upper))

  cbind(lower, upper, size)
}
seq2 <- function(lims) seq(lims[1], lims[2])

# The matrix
bm <- matrix(1, 10e3, 1e3)
ncores <- 3
intervals <- CutBySize(ncol(bm), ncores)
# Save each part in a different file
tmpfile <- tempfile()
for (ic in seq_len(ncores)) {
  saveRDS(bm[, seq2(intervals[ic, ])], 
          paste0(tmpfile, ic, ".rds"))
}
# Parallel computation with reading one part at the beginning
cl <- parallel::makeCluster(ncores)
doParallel::registerDoParallel(cl)
library(foreach)
colsums <- foreach(ic = seq_len(ncores), .combine = 'c') %dopar% {
  bm.part <- readRDS(paste0(tmpfile, ic, ".rds"))
  colSums(bm.part)
}
parallel::stopCluster(cl)
# Checking results
all.equal(colsums, colSums(bm))

You could even use rm(bm); gc() 你甚至可以使用rm(bm); gc() rm(bm); gc() after writing parts to the disk. 将部件写入磁盘后的rm(bm); gc()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM