使用pyspark将多个csv文件合并到Azure Blob存储中的一个csv文件

Question

I am using below code to save the csv files back to blob storage, though it is creating multiple files as it runs in loop. 我正在使用以下代码将csv文件保存回blob存储，尽管它在循环运行时会创建多个文件。 Now I would like to merge them into one single csv file. 现在，我想将它们合并到一个单独的csv文件中。 Though I have used dbutils.fs.cp/mv, it is not helpful 尽管我使用了dbutils.fs.cp / mv，但它没有帮助

while start_date <= end_date:
df = spark.read.format("com.databricks.spark.csv").options(header="true", inferschema="true").load(inputFilePath)
df.coalesce(1).write.mode("append").option("header","true").format("com.databricks.s`park.csv").save(TargetPath)`

A similar request has been posted below, but it has been done using pandas data frame and I am looking something with spark dataframe. 下面发布了类似的请求，但已使用pandas数据框完成了，而我正在寻找spark数据框。 " Copy data from multiple csv files into one csv file " “ 将数据从多个csv文件复制到一个csv文件中 ”

Answer 1

My suggestion would be, use while loop to create a list of csv files to read and then use spark csv readers to read them all at once. 我的建议是，使用while循环创建要读取的csv文件列表，然后使用spark csv读取器一次读取所有文件。 For example: 例如：

files = []
while start_date <= end_date:
    files.append(inputFilePath)


df = spark.read.format("com.databricks.spark.csv").options(header="true", inferschema="true").csv(files)

df.coalesce(1).write.mode("append").option("header","true").format("com.databricks.spark.csv").save(TargetPath)

使用pyspark将多个csv文件合并到Azure Blob存储中的一个csv文件

问题描述

1 个解决方案

解决方案1
0 已采纳 2019-08-20 02:28:57

使用pyspark将多个csv文件合并到Azure Blob存储中的一个csv文件

问题描述

1 个解决方案

解决方案1 0 已采纳 2019-08-20 02:28:57

解决方案1
0 已采纳 2019-08-20 02:28:57