[英]reading multiple csv files in s3 and combined them as a single file when the name of files are different using R
In each day I have multiple csv files with different names and I want to combine all CSVs in each day asa single file and put it in a loop for the other days as well. 每天我都有多个具有不同名称的csv文件,并且我想将每一天中的所有CSV合并为一个文件,并在以后的几天中将其放入循环中。
path= 's3://data/ y= 2017 /m= 05'
In m=05 I have multiple csv files (around 200) with different names and also in other days such as m=06 I have 120 csv files . 在m = 05中,我有多个具有不同名称的csv文件(大约200个),在其他日子(例如m = 06)中,我有120个csv文件。
dates<- seq(as.Date('2017-05-05'), as.Date('2017-06-10'), "days")
for (i in 1:length(dates)){
dateofgen<-dates
filepath <- paste(path, "y=", format(as.Date(dateofgen), '%Y'), "/m=", format(as.Date(dateofgen), '%m'),"/d=",format(as.Date(dateofgen),'%d'), "/part-00012-e731138c-232c-48b0-958f-55f2c72f3327-c000.csv", sep='')
data <- s3read_using(read.csv, object=filepath, stringsAsFactors = F, bucket=gsub("/.*", '', gsub("s3://", '', filepath)))
}
How can I read and combine all files of a day into a single file using rbind or any merge function. 如何使用rbind或任何合并功能将一天中的所有文件读取并合并为一个文件。
library(readxl)
library(dplyr)
This gets the names of all .xls files in your working directory. 这将获取工作目录中所有.xls文件的名称。 You can also use '*.csv'
您也可以使用“ * .csv”
file.list <- list.files(path = 's3://data/ y= 2017 /m= 05', pattern='*.xls')
This creates a nested list of your files. 这将创建文件的嵌套列表。
df.list <- lapply(file.list, read_excel)
This pulls everything out of the nested list and binds all rows together. 这会将所有内容从嵌套列表中拉出并将所有行绑定在一起。
tibble_of_your_xls_files <- bind_rows(df.list)
For your code I would run: 对于您的代码,我将运行:
file.list <- list.files(path = 's3://data/ y= 2017 /m= 05', pattern='*.csv')
df.list <- lapply(file.list, read_excel)
m052017.df <- bind_rows(df.list)
We will use get_bucket_df method to get access to the object in the bucket and then using ldply function go through all objects in different days in each month and read s3 object using s3read_using() . 我们将使用get_bucket_df方法访问存储桶中的对象,然后使用ldply函数遍历每个月中不同日期的所有对象,并使用s3read_using()读取s3对象。
days=as.character(c('17','18','19','20','21','22','23','24','25','26','27','28','29','30','31'))
for (i in 01:31){
path <- paste0("s3://data/ y= 2017 /m= 05/d=",days[i],sep = "")
temp_df <- get_bucket_df(bucket = "data", prefix = path)
temp_df <- temp_df[which(grepl(".csv", temp_df$Key)),]
new_data <- ldply(temp_df$Key, function(x){
s3path <- paste('s3://pa-datastore/',x,sep = "")
raw_data <- s3read_using(read.csv, na.strings = '', header = FALSE, object = s3path, stringsAsFactors = F, bucket=gsub("/.*", '', gsub("s3://", '', s3path)))
raw_data
})
dateofgen <- as.Date(paste0("2017-06-", days[i], sep = ""))
new_path <- "s3://data/"
filepath <- paste(new_path, "y=", format(as.Date(dateofgen), '%Y'), "/m=", format(as.Date(dateofgen), '%m'), "/newfile", dateofgen, ".csv", sep='')
s3write_using(new_data, FUN=write.csv, row.names = F, object = filepath, bucket = gsub("/.*", '', gsub("s3://", '', filepath)))
base::print(paste0("completed for ", dateofgen, sep =""))
}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.