如何在 R 中使用 tidyverse 並行匯總和綁定大型數據集？

Question

我有幾個大型的日常數據集，我需要按月匯總和綁定到 R 中。 由於數據集太大，我想並行進行匯總，以便更快。 我已經成功地用常規循環總結和綁定它們，但總結部分需要整晚。

數據集如下所示：

number     id   date
1      1        0102
1      1        0102
2      1        0102
2      2        0102

而且我要

number     id   day    count
1      1        0102    2
2      1        0102    1
2      2        0102   1

collapse_cdr<- function(data){
  dta<- data %>% 
    group_by(number,date, id) %>%
    summarise(count=n())  %>%
    mutate(total.calls=sum(count)) %>%
    slice(which.max(count))
  
  }

wd<-("working directory")

cl <- makeCluster(8)
registerDoParallel(cl)
month = foreach(i=day_code, .combine=rbind, .packages=c("tidyverse","readr")) %dopar%
 { filename<-paste0(wd,"/", i, ".csv")
    dta<-read_csv(filename, col_types = cols(.default = "c"))
    dta$date <- i
    dta<-collapse_cdr(data=dta)
    data.frame(dta)
  }

現在我收到警告關閉未使用的連接 62 (<-localhost:11439)

謝謝！

Answer 1

我建議使用 data.table 的方法

library(data.table)
library(doParallel)
library(foreach)

# Function to collapse the data
collapse_cdr <- function(d) {
  d[, .(count=.N), .(number,date,id)][
    ,total.calls:=sum(count), .(number,date)][
      , .SD[which.max(count)], .(number,date)]
}

wd<-("working directory")

cl <- makeCluster(8)
registerDoParallel(cl)
month = rbindlist(
  foreach(i=day_code) %dopar% {
    collapse_cdr(fread(paste0(wd,"/", i, ".csv"))[, date:=i])
  }
)
stopCluster(cl)

如何在 R 中使用 tidyverse 並行匯總和綁定大型數據集？

問題描述

1 個解決方案

解決方案1
1 2022-08-24 22:23:06

如何在 R 中使用 tidyverse 並行匯總和綁定大型數據集？

問題描述

1 個解決方案

解決方案1 1 2022-08-24 22:23:06

解決方案1
1 2022-08-24 22:23:06