简体   繁体   中英

How to loop through a folder of CSV files in R

I have a folder containing a bunch of CSV files that are titled "yob1980", "yob1981", "yob1982" etc.

I have to use a for loop to go through each file and put its contents into a data frame - the columns in the data frame should be "1980", "1981", "1982" etc

Here is what I have:

file_list <- list.files()

temp = list.files(pattern="*.txt")
babynames <- do.call(rbind,lapply(temp,read.csv, FALSE))

names(babynames) <- c("Name", "Gender", "Count")

I feel like I need a for loop, but I'm not sure how to loop through the files. Anyone point me in the right direction?

Consider an anonymous function within an lapply() :

files = list.files(pattern="*.txt")

dfList <- lapply(files, function(i) {
     df <- read.csv(i, header=FALSE, col.names=c("Name", "Gender", "Count"))
     df$Year <- gsub("yob", "", i) 
     return(df)
})

finaldf <- do.call(rbind, dflist)

My favourite way to do this is using ldply from the plyr package. It has the advantage of returning a dataframe, so you don't need to do the rbind step afterwards:

library( plyr )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count") )

As an added benefit, you can multi-thread the import very easily, making importing large multi-file datasets quite a bit faster:

library( plyr )
library( doMC )
registerDoMC( cores = 4 )
babynames <- ldply( .data = list.files(pattern="*.txt"),
                    .fun = read.csv,
                    header = FALSE,
                    col.names=c("Name", "Gender", "Count"),
                    .parallel = TRUE )

Changing the above slightly to include a Year column in the resulting data frame, you can create a function first, then execute that function within ldply in the same way you would execute read.csv

readFun <- function( filename ) {

    # read in the data
    data <- read.csv( filename, 
                      header = FALSE, 
                      col.names = c( "Name", "Gender", "Count" ) )

    # add a "Year" column by removing both "yob" and ".txt" from file name
    data$Year <- gsub( "yob|.txt", "", filename )

    return( data )
}

# execute that function across all files, outputting a data frame
doMC::registerDoMC( cores = 4 )
babynames <- plyr::ldply( .data = list.files(pattern="*.txt"),
                          .fun = readFun,
                          .parallel = TRUE )

This will give you your data in a concise and tidy way, which is how I'd recommend moving forward from here. While it is possible to then separate each year's data into it's own column, it's likely not the best way to go.

Note: depending on your preference, it may be a good idea to convert the Year column to say, integer class. But that's up to you.

Using purrr

library(tidyverse)

files <- list.files(path = "./data/", pattern = "*.csv")

df <- files %>% 
    map(function(x) {
        read.csv(paste0("./data/", x))
    }) %>%
    reduce(rbind)

A for loop might be more appropriate than lapply in this case.

file_list = list.files(pattern="*.txt")
data_list <- vector("list", "length" = length(file.list))

for (i in seq_along(file_list)) {
    filename = file_list[[i]]

    # Read data in
    df <- read.csv(filename, header = FALSE, col.names = c("Name", "Gender", "Count"))

    # Extract year from filename
    year = gsub("yob", "", filename)
    df[["Filename"]] = year

    # Add year to data_list
    data_list[[i]] <- df
}

babynames <- do.call(rbind, data_list)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM