简体   繁体   中英

reading multiple csv files in s3 and combined them as a single file when the name of files are different using R

In each day I have multiple csv files with different names and I want to combine all CSVs in each day asa single file and put it in a loop for the other days as well.

   path= 's3://data/ y= 2017 /m= 05'

In m=05 I have multiple csv files (around 200) with different names and also in other days such as m=06 I have 120 csv files .

dates<- seq(as.Date('2017-05-05'), as.Date('2017-06-10'), "days")
for (i in 1:length(dates)){
dateofgen<-dates
filepath <- paste(path, "y=", format(as.Date(dateofgen), '%Y'), "/m=", format(as.Date(dateofgen), '%m'),"/d=",format(as.Date(dateofgen),'%d'), "/part-00012-e731138c-232c-48b0-958f-55f2c72f3327-c000.csv", sep='')
data <- s3read_using(read.csv, object=filepath, stringsAsFactors = F, bucket=gsub("/.*", '', gsub("s3://", '', filepath)))
}

How can I read and combine all files of a day into a single file using rbind or any merge function.

    library(readxl)
    library(dplyr)

This gets the names of all .xls files in your working directory. You can also use '*.csv'

    file.list <- list.files(path = 's3://data/ y= 2017 /m= 05', pattern='*.xls')

This creates a nested list of your files.

    df.list <- lapply(file.list, read_excel)

This pulls everything out of the nested list and binds all rows together.

    tibble_of_your_xls_files <- bind_rows(df.list)

For your code I would run:

    file.list <- list.files(path = 's3://data/ y= 2017 /m= 05', pattern='*.csv')
    df.list <- lapply(file.list, read_excel)
    m052017.df <- bind_rows(df.list)

We will use get_bucket_df method to get access to the object in the bucket and then using ldply function go through all objects in different days in each month and read s3 object using s3read_using() .

days=as.character(c('17','18','19','20','21','22','23','24','25','26','27','28','29','30','31'))
​
for (i in 01:31){
  path <- paste0("s3://data/ y= 2017 /m= 05/d=",days[i],sep = "")
  temp_df <- get_bucket_df(bucket = "data", prefix = path)
  temp_df <- temp_df[which(grepl(".csv", temp_df$Key)),]
  new_data <- ldply(temp_df$Key, function(x){
    s3path <- paste('s3://pa-datastore/',x,sep = "")
    raw_data <- s3read_using(read.csv, na.strings = '', header = FALSE, object = s3path, stringsAsFactors = F, bucket=gsub("/.*", '', gsub("s3://", '', s3path)))
    raw_data
  })
  dateofgen <- as.Date(paste0("2017-06-", days[i], sep = ""))
  new_path <- "s3://data/"
  filepath <- paste(new_path, "y=", format(as.Date(dateofgen), '%Y'), "/m=", format(as.Date(dateofgen), '%m'), "/newfile", dateofgen, ".csv", sep='')
  s3write_using(new_data, FUN=write.csv, row.names = F, object = filepath, bucket = gsub("/.*", '', gsub("s3://", '', filepath)))
base::print(paste0("completed for ", dateofgen, sep =""))
}
​

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM