简体   繁体   English

使用R的doParallel包的多核计算是否使用更多内存?

[英]Does multicore computing using R's doParallel package use more memory?

I just tested an elastic net with and without a parallel backend. 我刚刚测试了一个带有和没有平行后端的弹性网。 The call is: 电话是:

enetGrid <- data.frame(.lambda=0,.fraction=c(.005))
ctrl <- trainControl( method="repeatedcv", repeats=5 )
enetTune <- train( x, y, method="enet", tuneGrid=enetGrid, trControl=ctrl, preProc=NULL )

I ran it without a parallel backend registered (and got the warning message from %dopar% when the train call was finished), and then again with one registered for 7 cores (of 8). 我在没有注册并行后端的情况下运行它(并且在train呼叫结束时从%dopar%获得警告消息),然后再一次注册7个核心(8个)。 The first run took 529 seconds, the second, 313. But the first took 3.3GB memory max (reported by the Sun cluster system), and the second took 22.9GB. 第一次运行需要529秒,第二次运行需要313次。但是第一次运行最多需要3.3GB内存(由Sun集群系统报告),第二次采用22.9GB。 I've got 30GB of ram, and the task only gets more complicated from here. 我有30GB的RAM,这个任务只会变得更加复杂。

Questions: 1) Is this a general property of parallel computation? 问题:1)这是并行计算的一般属性吗? I thought they shared memory.... 2) Is there a way around this while still using enet inside train ? 我以为他们共享了记忆.... 2)还有一种方法可以在train里面使用enet吗? If doParallel is the problem, are there other architectures that I could use with %dopar% --no, right? 如果doParallel是问题,是否有其他架构我可以使用%dopar% ,对吗?

Because I am interested in whether this is the expected result, this is closely related but not the exact same as this question, but I'd be fine closing this and merging my question in to that one (or marking that as duplicate and pointing to this one, since this has more detail) if that's what the concensus is: 因为我对这是否是预期的结果感兴趣,这与这个问题密切相关但不完全相同,但我会很好地关闭这个并将我的问题合并到那个(或标记为重复并指向这个,因为这有更多的细节)如果这是共识是什么:

Extremely high memory consumption of new doParallel package 新的doParallel封装的内存消耗极高

In multithreaded programs, threads share lots of memory. 在多线程程序中,线程共享大量内存。 It's primarily the stack that isn't shared between threads. 它主要是线程之间不共享的堆栈。 But, to quote Dirk Eddelbuettel, "R is, and will remain, single-threaded", so R parallel packages use processes rather than threads, and so there is much less opportunity to share memory. 但是,引用Dirk Eddelbuettel,“R是,并且将保持单线程”,因此R并行包使用进程而不是线程,因此共享内存的机会要少得多。

However, memory is shared between the processes that are forked by mclapply (as long as the processes don't modify it, which triggers a copy of the memory region in the operating system). 但是,内存在mclapply分叉的进程之间共享(只要进程不修改它,它就会触发操作系统中内存区域的副本)。 That is one reason that the memory footprint can be smaller when using the "multicore" API versus the "snow" API with parallel/doParallel. 这是使用“多核”API与具有parallel / doParallel的“snow”API时内存占用量可以更小的一个原因。

In other words, using: 换句话说,使用:

registerDoParallel(7)

may be much more memory efficient than using: 可能比使用更高的内存效率:

cl <- makeCluster(7)
registerDoParallel(cl)

since the former will cause %dopar% to use mclapply on Linux and Mac OS X, while the latter uses clusterApplyLB . 因为前者会引起%dopar%使用mclapply在Linux和Mac OS X,而后者使用clusterApplyLB

However, the "snow" API allows you to use multiple machines, and that means that your memory size increases with the number of CPUs. 但是,“snow”API允许您使用多台计算机,这意味着您的内存大小会随着CPU数量的增加而增加。 This is a great advantage since it can allow programs to scale. 这是一个很大的优势,因为它可以允许程序扩展。 Some programs even get super-linear speedup when running in parallel on a cluster since they have access to more memory. 有些程序甚至可以在群集上并行运行时获得超线性加速,因为它们可以访问更多内存。

So to answer your second question, I'd say to use the "multicore" API with doParallel if you only have a single machine and are using Linux or Mac OS X, but use the "snow" API with multiple machines if you're using a cluster. 所以,为了回答你的第二个问题,如果你只有一台机器并使用Linux或Mac OS X,我会说使用带有doParallel的“多核”API,但如果你使用多台机器则使用“snow”API使用群集。 I don't think there is any way to use shared memory packages such as Rdsm with the caret package. 我不认为有任何方法可以使用共享内存包,如Rdsmcaret包。

There is a minimum number of characters elsewise I would simply have typed: 1) Yes. 我只需输入最少的字符数:1)是的。 2) No, er, maybe. 2)不,呃,也许吧。 There are packages that use a "shared memory" model for parallel computation, but R's more thoroughly tested packages don't use it. 有些软件包使用“共享内存”模型进行并行计算,但R经过更彻底测试的软件包不使用它。

http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf http://www.stat.berkeley.edu/scf/paciorek-parallelWorkshop.pdf

http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf http://heather.cs.ucdavis.edu/~matloff/158/PLN/ParProcBook.pdf

http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf http://heather.cs.ucdavis.edu/Rdsm/BARUGSlides.pdf

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM