簡體   English   中英

使用 s3 和膠水時無法以冰山格式保存分區數據

[英]Unable to save partitioned data in in iceberg format when using s3 and glue

出現以下錯誤-

 java.lang.IllegalStateException: Incoming records violate the writer assumption that records are clustered by spec and by partition within each spec. Either cluster the incoming records or switch to fanout writers.
Encountered records that belong to already closed files:
partition 'year=2022/month=10/day=8/hour=12' in spec [
  1000: year: identity(24)
  1001: month: identity(25)
  1002: day: identity(26)
  1003: hour: identity(27)
]
        at org.apache.iceberg.io.ClusteredWriter.write(ClusteredWriter.java:96)
        at org.apache.iceberg.io.ClusteredDataWriter.write(ClusteredDataWriter.java:31)
        at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:758)
        at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:728)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:442)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

這是我在 spark 3.3 上運行的查詢,帶有膠水目錄並保存到 s3。 冰山版本是1.1.0——

USING iceberg
PARTITIONED BY (year, month, day, hour)
AS SELECT * from data

但是當我嘗試在不分區的情況下保存數據時,它沒有任何問題 -

CREATE TABLE my_catalog.test.iceberg_test
USING iceberg
PARTITIONED BY (year, month, day, hour)
AS SELECT * from data 

我該如何解決?

根據文檔,數據需要在保存之前進行排序 -

Iceberg 要求在寫入分區表之前根據每個任務(Spark 分區)的分區規范對數據進行排序。這適用於使用 SQL 寫入和使用數據幀寫入。

所以這就是我解決問題的方法 -

df = spark.read.orc("s3a://...")
df = df.sortWithinPartitions("year", "month", "day", "hour")
df.createOrReplaceTempView("data")

然后毫無問題地運行分區 sql 查詢。

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM