Spark Streaming HUDI HoodieException: Config conflict(key current value existing value): RecordKey:

Question

當我使用 spark 連接到 kafka 主題並創建 dataframe 然后存儲到 Hudi 中時：

df
.selectExpr("key", "topic", "partition", "offset", "timestamp", "timestampType", "CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("hudi")
.options(getQuickstartWriteConfigs)
.option(PRECOMBINE_FIELD.key(), "essDateTime")
.option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator")
.option(RECORDKEY_FIELD.key(), "offset,timestamp")//"offset,essDateTime")
.option(TBL_NAME.key, streamingTableName)
.option("path", baseStreamingPath)
.trigger(ProcessingTime(10000))
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()

我收到以下異常：

9:43
ERROR] 2023-01-31 09:35:25.474 [stream execution thread for [id = 8b30fd4b-8506-490b-80ad-76868c14594f, runId = 25d34e6f-10e2-42c2-b094-654797f5d79c]] HoodieStreamingSink - Micro batch id=1 threw following exception:
org.apache.hudi.exception.HoodieException: Config conflict(key  current value   existing value):
RecordKey:  offset,timestamp    uuid
KeyGenerator:   org.apache.hudi.keygen.ComplexKeyGenerator  org.apache.hudi.keygen.SimpleKeyGenerator
    at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:167) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
    at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:90) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
    at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:129) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
    at scala.util.Try$.apply(Try.scala:213) ~[scala-library-2.12.15.jar:?]
    at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:128) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
    at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:214) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
    at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:127) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
    at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:666) ~[spark-sql_2.12-3.3.1.jar:3.3.1]

將所有kafka數據存儲到Hudi表中

Answer 1

在 apache Hudi 中，有一些配置是你不能覆蓋的，比如KeyGenerator 。 看來您已經使用org.apache.hudi.keygen.SimpleKeyGenerator表，因此您需要重新創建表以更改此配置和分區鍵。

如果想快速測試，可以更改baseStreamingPath ，將數據寫入到新的Hudi表中。

Spark Streaming HUDI HoodieException: Config conflict(key current value existing value): RecordKey:

問題描述

1 個解決方案

解決方案1
1 2023-02-01 00:28:56

Spark Streaming HUDI HoodieException: Config conflict(key current value existing value): RecordKey:

問題描述

1 個解決方案

解決方案1 1 2023-02-01 00:28:56

解決方案1
1 2023-02-01 00:28:56