[英]Overwrite only some partitions in a partitioned spark Dataset
How can we overwrite a partitioned dataset, but only the partitions we are going to change?我们如何覆盖分区数据集,但只覆盖我们要更改的分区? For example, recomputing last week daily job, and only overwriting last week of data.
比如重新计算上周的日常作业,只覆盖上周的数据。
Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.默认的 Spark 行为是覆盖整个表,即使只有一些分区将被写入。
Since Spark 2.3.0 this is an option when overwriting a table.从 Spark 2.3.0 开始,这是覆盖表时的一个选项。 To overwrite it, you need to set the new
spark.sql.sources.partitionOverwriteMode
setting to dynamic
, the dataset needs to be partitioned, and the write mode overwrite
.要覆盖它,您需要将新的
spark.sql.sources.partitionOverwriteMode
设置设置为dynamic
,需要对数据集进行分区,并且写入模式overwrite
。 Example in scala : Scala 中的示例:
spark.conf.set(
"spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")
I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.我建议在写入之前根据您的分区列进行重新分区,这样每个文件夹最终不会有 400 个文件。
Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.在 Spark 2.3.0 之前,最好的解决方案是启动 SQL 语句删除这些分区,然后使用模式追加写入它们。
Just FYI, for PySpark users make sure to set overwrite=True
in the insertInto
otherwise the mode would be changed to append
仅供参考,对于 PySpark 用户,请确保在
insertInto
设置overwrite=True
否则模式将更改为append
from the source code :从源代码:
def insertInto(self, tableName, overwrite=False):
self._jwrite.mode(
"overwrite" if overwrite else "append"
).insertInto(tableName)
this how to use it:这是如何使用它:
spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)
or in the SQL version works fine.或者在 SQL 版本中工作正常。
INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.