简体   繁体   English

R中并行函数中的多个数据帧

[英]multiple dataframes in parallel function in R

In R I'm calling parLapply() on a list and filtering 2 dataframes within the function using the elements from the list eg在 R 中,我在列表上调用parLapply()并使用列表中的元素过滤函数内的 2 个数据帧,例如

myfunction <- function(id) {
  r1 <- r %>% filter(ID == id)
  b1<- b %>% filter(ID == id)
  doSomething(r1,b1)
}
result <- parLapply(cluster, listOfIDs, myfunction)

The SLURM system I'm using runs out of memory because, I think, I'm loading two large dataframes ( r and b ) every time myfunction() is called from parLapply() .我使用的内存用完,因为,我觉得SLURM系统,我装两个大dataframes( rb )每次myfunction()是由被称为parLapply() Memory isn't exceeded with smaller datasets.较小的数据集不会超出内存。

Therefore only I want to load a chunk of the dataframes, r and b , each time the function is called to lower the memory requirements.因此,每次调用该函数以降低内存要求时,我只想加载一大块数据帧rb Something like this (testing in series):像这样(系列测试):

library(doParallel)
library(foreach)
foreach(r1= split(r,
                      rep(1:nrow(r),
                          each = 1)))  %do% {
                            b1 <- b %>% filter(rowname == as.numeric(r1$rowname))
print(b1) # doSomething(r1, b1) 
}

But I would also like to filter b outside of the function so the whole dataframe isn't loaded in every instance.但我也想在函数之外过滤b以便整个数据帧不会在每个实例中加载。 b1 and r1 must have the same rowname . b1r1必须具有相同的rowname Is this possible??这可能吗??

Data数据

> dput(r)
structure(list(ID_DRAIN = c(115504, 115865, 115892, 115955, 115983, 
115940, 116033, 116028, 115873, 115905, 115835, 115885, 115452, 
115472, 115749, 115900, 115944, 115817, 115860, 115234, 115753, 
115505, 115899, 115939, 116015, 115191, 115214, 115339, 115799, 
115809, 115898, 115864), rowname = c("1", "7", "8", "9", "10", 
"11", "12", "14", "18", "19", "22", "23", "25", "26", "27", "29", 
"30", "37", "38", "39", "42", "44", "45", "46", "49", "50", "51", 
"57", "59", "60", "61", "63")), row.names = c(1L, 7L, 8L, 9L, 
10L, 11L, 12L, 14L, 18L, 19L, 22L, 23L, 25L, 26L, 27L, 29L, 30L, 
37L, 38L, 39L, 42L, 44L, 45L, 46L, 49L, 50L, 51L, 57L, 59L, 60L, 
61L, 63L), class = "data.frame")

> dput(b)
structure(list(LabelAtlas = structure(c(2L, 2L, 2L, 2L, 4L, 7L, 
7L, 7L, 7L, 7L, 7L, 7L, 7L), .Label = c("Culvert", "dam", "Ford", 
"Ramp/bed_sill", "sluice", "unknown", "weir"), class = "factor"), 
    rowname = c("57", "11", "7", "19", "11", "25", "38", "37", 
    "57", "57", "25", "25", "7")), row.names = c(325L, 413L, 
414L, 1607L, 2382L, 2837L, 2870L, 2945L, 3272L, 3402L, 3433L, 
3562L, 4753L), class = "data.frame")

Turns out you can give foreach more than 1 argument...原来你可以给foreach超过 1 个参数......

bgrouped <- b %>% group_by(groupID)

foreach(b1 = group_split(bgrouped), 
   r1 = split(r, rep(1:nrow(r), each = 1)), .combine=data.frame) %dopar% {
   function(b1, r1)
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM