Spark 在 S3 中创建额外的分区列

Question

I am writing a dataframe to s3 as shown below.我正在将 dataframe 写入 s3，如下所示。 Target location: s3://test/folder目标位置：s3://test/folder

   val targetDf = spark.read.schema(schema).parquet(targetLocation)
    val df1=spark.sql("select * from sourceDf")
    val df2=spark.sql(select * from targetDf)
/*    
for loop over a date range to dedup and write the data to s3
union dfs and run a dedup logic, have omitted dedup code and for loop
*/
    val df3=spark.sql("select * from df1 union all select * from df2")
    df3.write.partitionBy(data_id, schedule_dt).parquet("targetLocation")

Spark is creating extra partition column on write like shown below: Spark 在写入时创建额外的分区列，如下所示：

Exception in thread "main" java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

Partition column name list #0: data_id, schedule_dt
Partition column name list #1: data_id, schedule_dt, schedule_dt

EMR optimizer class is enabled while writing, I am using spark 2.4.3 Please let me know what could be causing this error.写入时启用 EMR 优化器 class，我使用的是 spark 2.4.3 请告诉我可能导致此错误的原因。

Thanks Abhineet感谢 Abhineet

Answer 1

You should give 1 extra column apart from partitioned columns.除了分区列之外，您应该额外提供 1 列。 Can you please try你能试试吗

val df3=df1.union(df2)

instead of代替

val df3=spark.sql("select data_id,schedule_dt from df1 union all select data_id,schedule_dt from df2")

Spark 在 S3 中创建额外的分区列

问题描述

1 个解决方案

解决方案1
0 2020-04-24 07:43:26

Spark 在 S3 中创建额外的分区列

问题描述

1 个解决方案

解决方案1 0 2020-04-24 07:43:26

解决方案1
0 2020-04-24 07:43:26