简体   繁体   English

在 r 循环中将多个 csv 文件组合在一起

[英]Combining multiple csv files together in an r loop

I have a folder with many csv files in it.我有一个文件夹,里面有很多 csv 文件。 They are all structured as per picture它们都按照图片结构

在此处输入图片说明

I am interested to count the numbers under my variable "Type" and get an output that tells me that there are two number 7, two number 9, two 1 and so on.我有兴趣计算变量“类型”下的数字,并得到一个输出,告诉我有两个数字 7、两个数字 9、两个 1 等等。 I want to do this for the csv files in my folder, and it would be great to bind the outputs from different files together (with an identifier to the original file the output was extracted from).我想对我文件夹中的 csv 文件执行此操作,将来自不同文件的输出绑定在一起会很棒(使用标识符指向从中提取输出的原始文件)。 So far, I managed to do it for individual files with this code:到目前为止,我设法使用以下代码对单个文件执行此操作:

mydata <- read.csv("1_data.csv", skip=1, header = T)
df <- data.frame(table(mydata$Type))

However, I tried to code a loop and got stucked.但是,我尝试编写一个循环并被卡住了。 This is the code I am using:这是我正在使用的代码:

files = list.files(pattern = "*.csv")

for (i in files) {
  id <- substr(i, 1, 5)
  mydata <- read.csv (i, skip=1, header = T)
  datatobind <- data.frame(table(mydata$Type))
  datatobind["id"] <- as.numeric(id)
  data <- rbind(data, datatobind)
}
do.call (rbind, data)

write.csv(data, file='final.csv', row.names=FALSE)

I get a different error every time I try to change the code, so I am not sure how to fix this.每次我尝试更改代码时都会遇到不同的错误,所以我不知道如何解决这个问题。

Here are couple of ways to do count Type column from each file, add a new column with the filename and bind the output together.这里有几种方法可以从每个文件中计算Type列,添加一个带有文件名的新列并将输出绑定在一起。

Using base R :使用基础 R :

files = list.files(pattern = "*.csv", full.names = TRUE)

new_data <- do.call(rbind, lapply(files, function(x) {
                  mydata <- read.csv(x, skip=1, header = TRUE)
                  transform(as.data.frame(table(mydata$Type)), 
                            filename = basename(x))
            }))

and with tidyverse :并与tidyverse

library(dplyr)

new_data <- purrr::map_df(files, function(x) {
  mydata <- read.csv(x, skip=1, header = TRUE)
  mydata %>%
    count(Type) %>%
    mutate(filename = basename(x))
})

Here is a parallel version that suits your needs.这是一个适合您需求的并行版本。 You may need to install doSNOW and parallel packages:您可能需要安装doSNOW并行包:

library(doSNOW)
library(parallel)

setwd("path/to/folder")

all_files = list.files(pattern = "\\.csv$")
num_files = length(all_files)

cl <- makeCluster(min(num_files, floor(detectCores()*0.9)), outfile = "")
registerDoSNOW(cl)
dataset <- foreach(i=1:num_files, .combine='rbind') %dopar% 
{
  read.csv(all_files[i], header=TRUE)
}
stopCluster(cl)
registerDoSEQ()

write.csv(dataset, file='final.csv', row.names=FALSE)

Tested on Windows 10 x64, with massive speedup vs regular loop.在 Windows 10 x64 上进行了测试,与常规循环相比具有巨大的加速。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM