简体   繁体   English

如何在R中循环访问CSV文件的文件夹

[英]How to loop through a folder of CSV files in R

I have a folder containing a bunch of CSV files that are titled "yob1980", "yob1981", "yob1982" etc. 我有一个文件夹,其中包含一堆标题为“yob1980”,“yob1981”,“yob1982”等的CSV文件。

I have to use a for loop to go through each file and put its contents into a data frame - the columns in the data frame should be "1980", "1981", "1982" etc 我必须使用for循环遍历每个文件并将其内容放入数据框 - 数据框中的列应为“1980”,“1981”,“1982”等

Here is what I have: 这是我有的:

file_list <- list.files()

temp = list.files(pattern="*.txt")
babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))

names(babynames) <- c("Name", "Gender", "Count")

I feel like I need a for loop, but I'm not sure how to loop through the files. 我觉得我需要一个for循环,但我不确定如何遍历文件。 Anyone point me in the right direction? 有人指出我正确的方向吗?

Consider an anonymous function within an lapply() : 考虑lapply()的匿名函数:

files = list.files(pattern="*.txt")

dfList <- lapply(files, function(i) {
     df <- read.csv(i, header=FALSE, col.names=c("Name", "Gender", "Count"))
     df$Year <- gsub("yob", "", i) 
     return(df)
})

finaldf <- do.call(rbind, dflist)

My favourite way to do this is using ldply from the plyr package. 我最喜欢的方式做到这一点是使用ldplyplyr包。 It has the advantage of returning a dataframe, so you don't need to do the rbind step afterwards: 它具有返回数据帧的优点,因此您不需要在之后执行rbind步骤:

library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count") )

As an added benefit, you can multi-thread the import very easily, making importing large multi-file datasets quite a bit faster: 另外一个好处是,您可以非常轻松地对导入进行多线程处理,从而可以更快地导入大型多文件数据集:

library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count"),
                    .parallel = TRUE )

Changing the above slightly to include a Year column in the resulting data frame, you can create a function first, then execute that function within ldply in the same way you would execute read.csv 稍微更改上面的内容以在结果数据框中包含Year列,您可以先创建一个函数,然后在ldply中执行该函数, ldply执行read.csv

readFun <- function( filename ) {

    # read in the data
    data <- read.csv( filename, 
                      header = FALSE, 
                      col.names = c( "Name", "Gender", "Count" ) )

    # add a "Year" column by removing both "yob" and ".txt" from file name
    data$Year <- gsub( "yob|.txt", "", filename )

    return( data )
}

# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
                          .fun = readFun,
                          .parallel = TRUE )

This will give you your data in a concise and tidy way, which is how I'd recommend moving forward from here. 这将以简洁的方式为您提供数据,这就是我建议从这里向前推进的方式。 While it is possible to then separate each year's data into it's own column, it's likely not the best way to go. 虽然可以将每年的数据分成它自己的专栏,但这可能不是最好的方法。

Note: depending on your preference, it may be a good idea to convert the Year column to say, integer class. 注意:根据您的偏好,将Year列转换为integer类可能是个好主意。 But that's up to you. 但这取决于你。

Using purrr 使用purrr

library(tidyverse)

files <- list.files(path = "./data/", pattern = "*.csv")

df <- files %>% 
    map(function(x) {
        read.csv(paste0("./data/", x))
    }) %>%
    reduce(rbind)

A for loop might be more appropriate than lapply in this case. 在这种情况下, for循环可能比lapply更合适。

file_list = list.files(pattern="*.txt")
data_list <- vector("list", "length" = length(file.list))

for (i in seq_along(file_list)) {
    filename = file_list[[i]]

    # Read data in
    df <- read.csv(filename, header = FALSE, col.names = c("Name", "Gender", "Count"))

    # Extract year from filename
    year = gsub("yob", "", filename)
    df[["Filename"]] = year

    # Add year to data_list
    data_list[[i]] <- df
}

babynames <- do.call(rbind, data_list)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM