如何以更快的方式处理和组合列表中的 data.frames

Question

Finally, I come to an issue that very slow data processing and appending rows of multiple data.frames .最后，我遇到了一个问题，即数据处理速度非常慢并附加多个data.frames行。 I use lapply and dplyr combination for data processing.我使用lapply和dplyr组合进行数据处理。 OTH, the process becomes very slower as I have 20000 rows in each data frame multiplied with 100 files in the directory. OTH，该过程变得非常慢，因为我在每个数据框中有 20000 行与目录中的 100 个文件相乘。

Currently this is a huge bottle neck for me as even after lapply process finishes I don't have enough memory to bind_rows process.目前，这对我来说是一个巨大的瓶颈，因为即使在lapply进程完成后，我也没有足够的内存来进行bind_rows进程。

Here is my data processing method,这是我的数据处理方法，

first make a list of files首先列出文件列表

files <- list.files("file_directory",pattern = "w.*.csv",recursive=T,full.names = TRUE)

then process this list of files然后处理这个文件列表

  library(tidyr)
  library(dplyr)

data<- lapply(files,function(x){
    tmp <- read.table(file=x, sep=',', header = T,fill=F,skip=0, stringsAsFactors = F,row.names=NULL)%>%

      select(A,B, C)%>%
      unite(BC,BC,sep='_')%>%

      mutate(D=C*A)%>%
      group_by(BC)%>%
      mutate(KK=median(C,na.rm=TRUE))%>%
      select(BC,KK,D)
  })

data <- bind_rows(data)

I am getting an error which says,我收到一个错误，上面写着，

“Error: cannot allocate vector of size ... Mb” ... “错误：无法分配大小为 ... Mb 的向量” ...

Depends on how much left in my ram.取决于我的 ram 中还剩多少。 I have 8 Gb ram but seems still struggling;(我有 8 Gb 内存，但似乎仍在挣扎；（

I also tried do.call but nothing changed!我也试过do.call但没有任何改变！ Who is my friendly function or approach for this issue?谁是我对这个问题的友好功能或方法？ I use R version 3.4.2 and dplyr 0.7.4.我使用 R 版本 3.4.2 和 dplyr 0.7.4。

Answer 1

I can't test this answer since there's no reproducible data but I guess it could be something like the following, using data.table:我无法测试这个答案，因为没有可重复的数据，但我想它可能类似于以下内容，使用 data.table：

library(data.table)

data <- setNames(lapply(files, function(x) {
  fread(x, select = c("A", "B", "C"))
}), basename(files))

data <- rbindlist(data, use.names = TRUE, fill = TRUE, id = "file_id")
data[, BC := paste(B, C, sep = "_")]
data[, D := C * A]
data[, KK := median(C, na.rm = TRUE), by = .(BC, file_id)]
data[, setdiff(names(data), c("BC", "KK", "D")) := NULL]

Answer 2

Using ldply from the plyr package would eliminate the need to bind the list after processing as it will output a data.frame使用ldply包中的plyr将消除处理后绑定列表的需要，因为它将输出一个 data.frame

library(tidyr)
library(dplyr)
library(plyr)

files <- list.files("file_directory", pattern = "w.*.csv", recursive = TRUE, full.names = TRUE)

data<- ldply(files, function(x){
  read.table(file=x, sep=',', header = TRUE, fill = FALSE, skip = 0, stringsAsFactors = FALSE, row.names = NULL) %>%
    select(A, B, C) %>%
    unite(BC, BC, sep='_') %>%
    mutate(D = C * A) %>%
    group_by(BC) %>%
    mutate(KK = median(C, na.rm = TRUE)) %>%
    select(BC, KK, D)
})

如何以更快的方式处理和组合列表中的 data.frames

问题描述

first make a list of files首先列出文件列表

then process this list of files然后处理这个文件列表

2 个解决方案

解决方案1
4 已采纳 2017-10-12 14:50:00

解决方案2
2 2017-10-12 14:54:10

如何以更快的方式处理和组合列表中的 data.frames

问题描述

first make a list of files首先列出文件列表

then process this list of files然后处理这个文件列表

2 个解决方案

解决方案1 4 已采纳 2017-10-12 14:50:00

解决方案2 2 2017-10-12 14:54:10

解决方案1
4 已采纳 2017-10-12 14:50:00

解决方案2
2 2017-10-12 14:54:10