R脚本耗尽内存 - Microsoft HPC群集

Question

I have an R script with the following source code: 我有一个带有以下源代码的R脚本：

genofile<-read.table("D_G.txt", header=T, sep=',')
genofile<-genofile[genofile$"GC_SCORE">0.15,]
cat(unique(as.vector(genofile[,2])), file="GF_uniqueIDs.txt", sep='\n')

D_G.txt is a huge file, about 5 GBytes. D_G.txt是一个巨大的文件，大约5 GBytes。

Now, the computation is performed on a Microsoft HPC cluster so, as you know, when I submit the job it gets "splitted" across different physical nodes; 现在，计算是在Microsoft HPC集群上执行的，因此，正如您所知，当我提交作业时，它会在不同的物理节点上“分裂”; in my case each one has 4 GB of RAM memory. 在我的情况下，每个人都有4 GB的RAM内存。

Well, after a variable amount of time, I get the infamous error cannot allocate vector of size xxx Mb message. 好吧，经过一段可变的时间后，我得到了臭名昭着的错误cannot allocate vector of size xxx Mb消息的cannot allocate vector of size xxx Mb 。 I've tried to use a switch which limits the usable memory: 我试过使用限制可用内存的开关：

--max-memory=1GB

but nothing change. 但没有变化。

I've tried Rscript 2.15.0 both 32 and 64 bit with no luck. 我已经尝试了Rscript 2.15.0 32和64位没有运气。

Answer 1

The fact that your dataset as such should fit in the memory of one node does not mean that when performing an analysis on it also means that it fits in memory. 事实上，您的数据集应该适合一个节点的内存并不意味着在对它进行分析时也意味着它适合内存。 Often analyses cause copying of data. 通常，分析会导致数据复制。 In addition, some inefficient programming from your side could also increase memory usage. 此外，您身边的一些低效编程也可能会增加内存使用量。 Setting the switch and limiting the memory use of R only makes things worse. 设置开关和限制R的内存使用只会让事情变得更糟。 It does not limit the actual memory usage, it limits the maximum memory usage. 它不限制实际的内存使用量，它限制了最大内存使用量。 And using a 32 bit OS is always a bit idea memory wise, as the maximum memory that can be addressed by a single process using a 32 bit OS is less than 4 GB. 使用32位操作系统总是有点想法，因为使用32位操作系统的单个进程可以寻址的最大内存小于4 GB。

Without more details it is hard to help you any further with this problem. 没有更多细节，很难再帮助您解决这个问题。 In general I would recommend to cut the dataset up in smaller and smaller pieces, until you succeed. 一般情况下，我建议将数据集剪切成越来越小的部分，直到成功为止。 I assume that your problem is embarrassingly parallel, and cutting up your dataset further does not change anything to the output. 我假设您的问题令人尴尬地平行，并且进一步切割数据集不会改变输出。

R脚本耗尽内存 - Microsoft HPC群集

问题描述

1 个解决方案

解决方案1
2 已采纳 2012-04-17 08:50:43

R脚本耗尽内存 - Microsoft HPC群集

问题描述

1 个解决方案

解决方案1 2 已采纳 2012-04-17 08:50:43

解决方案1
2 已采纳 2012-04-17 08:50:43