简体   繁体   English

仅覆盖分区 spark 数据集中的某些分区

[英]Overwrite only some partitions in a partitioned spark Dataset

How can we overwrite a partitioned dataset, but only the partitions we are going to change?我们如何覆盖分区数据集,但只覆盖我们要更改的分区? For example, recomputing last week daily job, and only overwriting last week of data.比如重新计算上周的日常作业,只覆盖上周的数据。

Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.默认的 Spark 行为是覆盖整个表,即使只有一些分区将被写入。

Since Spark 2.3.0 this is an option when overwriting a table.从 Spark 2.3.0 开始,这是覆盖表时的一个选项。 To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic , the dataset needs to be partitioned, and the write mode overwrite .要覆盖它,您需要将新的spark.sql.sources.partitionOverwriteMode设置设置为dynamic ,需要对数据集进行分区,并且写入模式overwrite Example in scala : Scala 中的示例:

spark.conf.set(
  "spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.我建议在写入之前根据您的分区列进行重新分区,这样每个文件夹最终不会有 400 个文件。

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.在 Spark 2.3.0 之前,最好的解决方案是启动 SQL 语句删除这些分区,然后使用模式追加写入它们。

Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append仅供参考,对于 PySpark 用户,请确保在insertInto设置overwrite=True否则模式将更改为append

from the source code :源代码

def insertInto(self, tableName, overwrite=False):
    self._jwrite.mode(
        "overwrite" if overwrite else "append"
    ).insertInto(tableName)

this how to use it:这是如何使用它:

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.或者在 SQL 版本中工作正常。

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here 医生看这里

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM