Spark：如何覆盖 S3 文件夹上的文件而不是完整的文件夹

Question

Using Spark I am trying to push some data(in csv, parquet format) to S3 bucket.使用 Spark 我正在尝试将一些数据（以 csv、parquet 格式）推送到 S3 存储桶。

df.write.mode("OVERWRITE").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)

Answer 1

Short answer: Set the Spark configuration parameter spark.sql.sources.partitionOverwriteMode to dynamic instead of static.简短回答：将 Spark 配置参数spark.sql.sources.partitionOverwriteMode为dynamic而不是静态。 This will only overwrite the necessary partitions and not all of them.这只会覆盖必要的分区，而不是全部。 PySpark example: PySpark 示例：

conf=SparkConf().setAppName("test).set("spark.sql.sources.partitionOverwriteMode","dynamic").setMaster("yarn")
sc = SparkContext(conf=conf)
sqlContext = sql.SQLContext(sc)

Answer 2

The file's can be deleted first and then use append mode to insert the data instead of overwriting to retain the sub folder's.可以先删除文件，然后使用追加模式插入数据，而不是覆盖以保留子文件夹。 Below is an example from Pyspark.以下是 Pyspark 的一个示例。

import subprocess
subprocess.call(["hadoop", "fs", "-rm", "{}*.csv.deflate".format(destination_path)])

df.write.mode("append").format("com.databricks.spark.csv").options(nullValue=options['nullValue'], header=options['header'], delimiter=options['delimiter'], quote=options['quote'], escape=options['escape']).save(destination_path)

Spark：如何覆盖 S3 文件夹上的文件而不是完整的文件夹

问题描述

2 个解决方案

解决方案1
3 2019-09-04 23:54:20

解决方案2
0 2022-01-26 03:44:44

Spark：如何覆盖 S3 文件夹上的文件而不是完整的文件夹

问题描述

2 个解决方案

解决方案1 3 2019-09-04 23:54:20

解决方案2 0 2022-01-26 03:44:44

解决方案1
3 2019-09-04 23:54:20

解决方案2
0 2022-01-26 03:44:44