R：如何在foreach％dopar％中拆分数据帧

Question

This is a very simple example. 这是一个非常简单的例子。

df = c("already ","miss you","haters","she's cool")
df = data.frame(df)

library(doParallel)
cl = makeCluster(4)
registerDoParallel(cl)    
foreach(i = df[1:4,1], .combine = rbind, .packages='tm')  %dopar% classification(i)
stopCluster(cl)

In real case I have dataframe with n=400000 rows. 在实际情况中，我有n = 400000行的数据帧。 I don't know how to send nrow/ncluster data for each cluster in one step, i = ? 我不知道如何在一个步骤中为每个集群发送nrow / ncluster数据，i =？

I tried with isplitRows from library(itertools) without success. 我尝试使用来自库（itertools）的isplitRows但没有成功。

Answer 1

You should try to work with indices to create subsets of your data. 您应该尝试使用索引来创建数据的子集。

foreach(i = nrow(df), .combine = rbind, .packages='tm')  %dopar% {
  tmp <- df[i, ]
  classification(tmp)
}

This will take a new row of the data.frame each iteration. 这将在每次迭代时获取data.frame的新行。

Furthermore, you should notice that the result of a foreach loop will be written to a new variable. 此外，您应该注意到foreach循环的结果将写入新变量。 Thus, you should assign it like this: 因此，您应该像这样分配：

res <- foreach(i = 1:10, .combine = c, ....) %dopar% {
  # things you want to do
  x <- someFancyFunction()

  # the last value will be returned and combined by the .combine function
  x 
}

Answer 2

Try using a combination of split and mclapply as proposed in Aproach 1 here: https://www.r-bloggers.com/trying-to-reduce-the-memory-overhead-when-using-mclapply/ 尝试使用Aproach 1中提出的split和mclapply的组合： https ： mclapply

split lets you split data into groups defined by a factor, or you can just use 1:nrow(df) if you want to do the operation on each row seperately. split允许您将数据拆分为由因子定义的组，或者如果要单独对每一行执行操作，则可以使用1:nrow(df) 。

Answer 3

My solution after your comments: 您的意见后我的解决方案：

n = 8  #number of cluster
library(foreach)
library(doParallel)
cl = makeCluster(n)
registerDoParallel(cl)

z = nrow(df)
y = floor(z/n) 
x = nrow(df)%%n

ris = foreach(i = split(df[1:(z-x),],rep(1:n,each=y)), .combine = rbind, .packages='tm')  %dopar% someFancyFunction(i)

stopCluster(cl)

#sequential
if (x !=0 )
    ris = rbind(ris,someFancyFunction(df[(z-x+1):z,1]))

Note: I used the sequential esecution at the end, because if "x" is not zero, the function split put the rest of rows (z-(zx)) in the first cluster, and change the order of the result. 注意：我最后使用了顺序执行，因为如果“x”不为零，则函数split将其余行（z-（zx））放在第一个簇中，并更改结果的顺序。

R：如何在foreach％dopar％中拆分数据帧

问题描述

3 个解决方案

解决方案1
3 2016-09-29 08:40:15

解决方案2
0 2016-09-29 08:46:42

解决方案3
0 2016-09-29 19:27:37

R：如何在foreach％dopar％中拆分数据帧

问题描述

3 个解决方案

解决方案1 3 2016-09-29 08:40:15

解决方案2 0 2016-09-29 08:46:42

解决方案3 0 2016-09-29 19:27:37

解决方案1
3 2016-09-29 08:40:15

解决方案2
0 2016-09-29 08:46:42

解决方案3
0 2016-09-29 19:27:37