仅覆盖分区 spark 数据集中的某些分区

Question

How can we overwrite a partitioned dataset, but only the partitions we are going to change?我们如何覆盖分区数据集，但只覆盖我们要更改的分区？ For example, recomputing last week daily job, and only overwriting last week of data.比如重新计算上周的日常作业，只覆盖上周的数据。

Default Spark behaviour is to overwrite the whole table, even if only some partitions are going to be written.默认的 Spark 行为是覆盖整个表，即使只有一些分区将被写入。

Answer 1

Since Spark 2.3.0 this is an option when overwriting a table.从 Spark 2.3.0 开始，这是覆盖表时的一个选项。 To overwrite it, you need to set the new spark.sql.sources.partitionOverwriteMode setting to dynamic , the dataset needs to be partitioned, and the write mode overwrite .要覆盖它，您需要将新的spark.sql.sources.partitionOverwriteMode设置设置为dynamic ，需要对数据集进行分区，并且写入模式overwrite 。 Example in scala : Scala 中的示例：

spark.conf.set(
  "spark.sql.sources.partitionOverwriteMode", "dynamic"
)
data.write.mode("overwrite").insertInto("partitioned_table")

I recommend doing a repartition based on your partition column before writing, so you won't end up with 400 files per folder.我建议在写入之前根据您的分区列进行重新分区，这样每个文件夹最终不会有 400 个文件。

Before Spark 2.3.0, the best solution would be to launch SQL statements to delete those partitions and then write them with mode append.在 Spark 2.3.0 之前，最好的解决方案是启动 SQL 语句删除这些分区，然后使用模式追加写入它们。

Answer 2

Just FYI, for PySpark users make sure to set overwrite=True in the insertInto otherwise the mode would be changed to append仅供参考，对于 PySpark 用户，请确保在insertInto设置overwrite=True否则模式将更改为append

from the source code :从源代码：

def insertInto(self, tableName, overwrite=False):
    self._jwrite.mode(
        "overwrite" if overwrite else "append"
    ).insertInto(tableName)

this how to use it:这是如何使用它：

spark.conf.set("spark.sql.sources.partitionOverwriteMode","DYNAMIC")
data.write.insertInto("partitioned_table", overwrite=True)

or in the SQL version works fine.或者在 SQL 版本中工作正常。

INSERT OVERWRITE TABLE [db_name.]table_name [PARTITION part_spec] select_statement

for doc look at here 医生看这里

仅覆盖分区 spark 数据集中的某些分区

问题描述

2 个解决方案

解决方案1
57 已采纳 2018-04-24 16:20:34

解决方案2
17 2018-05-16 04:07:08

仅覆盖分区 spark 数据集中的某些分区

问题描述

2 个解决方案

解决方案1 57 已采纳 2018-04-24 16:20:34

解决方案2 17 2018-05-16 04:07:08

解决方案1
57 已采纳 2018-04-24 16:20:34

解决方案2
17 2018-05-16 04:07:08