[英]Spark Streaming HUDI HoodieException: Config conflict(key current value existing value): RecordKey:
當我使用 spark 連接到 kafka 主題並創建 dataframe 然后存儲到 Hudi 中時:
df
.selectExpr("key", "topic", "partition", "offset", "timestamp", "timestampType", "CAST(key AS STRING)", "CAST(value AS STRING)")
.writeStream
.format("hudi")
.options(getQuickstartWriteConfigs)
.option(PRECOMBINE_FIELD.key(), "essDateTime")
.option("hoodie.datasource.write.keygenerator.class","org.apache.hudi.keygen.ComplexKeyGenerator")
.option(RECORDKEY_FIELD.key(), "offset,timestamp")//"offset,essDateTime")
.option(TBL_NAME.key, streamingTableName)
.option("path", baseStreamingPath)
.trigger(ProcessingTime(10000))
.outputMode("append")
.option("checkpointLocation", checkpointLocation)
.start()
我收到以下異常:
9:43
ERROR] 2023-01-31 09:35:25.474 [stream execution thread for [id = 8b30fd4b-8506-490b-80ad-76868c14594f, runId = 25d34e6f-10e2-42c2-b094-654797f5d79c]] HoodieStreamingSink - Micro batch id=1 threw following exception:
org.apache.hudi.exception.HoodieException: Config conflict(key current value existing value):
RecordKey: offset,timestamp uuid
KeyGenerator: org.apache.hudi.keygen.ComplexKeyGenerator org.apache.hudi.keygen.SimpleKeyGenerator
at org.apache.hudi.HoodieWriterUtils$.validateTableConfig(HoodieWriterUtils.scala:167) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieSparkSqlWriter$.write(HoodieSparkSqlWriter.scala:90) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$2(HoodieStreamingSink.scala:129) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at scala.util.Try$.apply(Try.scala:213) ~[scala-library-2.12.15.jar:?]
at org.apache.hudi.HoodieStreamingSink.$anonfun$addBatch$1(HoodieStreamingSink.scala:128) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.retry(HoodieStreamingSink.scala:214) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.hudi.HoodieStreamingSink.addBatch(HoodieStreamingSink.scala:127) ~[hudi-spark3-bundle_2.12-0.12.2.jar:0.12.2]
at org.apache.spark.sql.execution.streaming.MicroBatchExecution.$anonfun$runBatch$17(MicroBatchExecution.scala:666) ~[spark-sql_2.12-3.3.1.jar:3.3.1]
將所有kafka數據存儲到Hudi表中
在 apache Hudi 中,有一些配置是你不能覆蓋的,比如KeyGenerator
。 看來您已經使用org.apache.hudi.keygen.SimpleKeyGenerator
表,因此您需要重新創建表以更改此配置和分區鍵。
如果想快速測試,可以更改baseStreamingPath
,將數據寫入到新的Hudi表中。
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.