合并和附加 ffdf 数据帧列表

Question

I would like to read a vector of CSV files names as ffdf data frames and combine them into one big ffdf data frame.我想读取一个包含 CSV 个文件名的矢量作为ffdf数据帧，并将它们组合成一个大ffdf数据帧。 I have found solutions using other r packages;我找到了使用其他r包的解决方案； however, my issue is my data (combined) can reach 40GB which definitely needs to be stored on disk, as ff package does, and not in the RAM.然而，我的问题是我的数据（合并）可以达到 40GB，这肯定需要存储在磁盘上，就像ff package 所做的那样，而不是在 RAM 中。 There are awesome solutions here using RAM storage, as far as I know.据我所知，这里有使用 RAM 存储的很棒的解决方案。

library(ffbase)
library(ff)

# Create list of csv files
csv_files <- list.files(path = input_path,
                        pattern="*.csv",
                        full.names = T)

# my approach so far
# this use fread, and it appears to be consuming RAM 

# Read the files in, assuming comma separator
csv_files_df <- lapply(csv_files, function(x) {
y<-unlist(str_split(x, "[.]"))[1]
    assign(y,
   as.ffdf(fread(x,stringsAsFactors = T)))})

# Combine them
combined_df <- do.call("ffdfappend", lapply(csv_files_df, as.ffdf))

When I try to combine them, it fires this error.当我尝试组合它们时，它会引发此错误。

> combined_df <- do.call("ffdfappend", lapply(csv_files_df, as.ffdf))
Error in ffdfappend(list(virtual = list(VirtualVmode = c("double", "integer",  : 
  'list' object cannot be coerced to type 'logical'

Summary: I would like to read and merge the CSV files using only ff package without the need for another package to avoid OOM (Out Of Memory) status.摘要：我想仅使用ff package 读取和合并 CSV 文件，而不需要另一个 package 以避免 OOM（内存不足）状态。

Answer 1

The ffdfappend() function only takes two data arguments - x and y . ffdfappend() function 只需要两个数据 arguments - x和y 。 When you provide a list, it is assuming some of the data frames are the other arguments to ffdfappend() .当您提供列表时，它假设一些数据帧是另一个 arguments 到ffdfappend() 。 To use this function in the way you intend, you probably need to write it in a loop, something like this:要以您打算的方式使用此 function，您可能需要将其写在一个循环中，如下所示：

csv_files <- list.files(path = input_path,
                        pattern="*.csv",
                        full.names = T)

# my approach so far
# this use fread, and it appears to be consuming RAM 

read <- function(x) {
  y<-unlist(str_split(x, "[.]"))[1]
  assign(y,
         as.ffdf(fread(x,stringsAsFactors = T)))}

# Read the files in, assuming comma separator
out <- read(csv_files[1])

for(i in 2:length(csv_files)){
  out <- ffdfappend(out, read(csv_files(i)))
}

合并和附加 ffdf 数据帧列表

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-03 15:09:59

合并和附加 ffdf 数据帧列表

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-03 15:09:59

解决方案1
1 已采纳 2022-05-03 15:09:59