使用 s3 和膠水時無法以冰山格式保存分區數據

Question

出現以下錯誤-

 java.lang.IllegalStateException: Incoming records violate the writer assumption that records are clustered by spec and by partition within each spec. Either cluster the incoming records or switch to fanout writers.
Encountered records that belong to already closed files:
partition 'year=2022/month=10/day=8/hour=12' in spec [
  1000: year: identity(24)
  1001: month: identity(25)
  1002: day: identity(26)
  1003: hour: identity(27)
]
        at org.apache.iceberg.io.ClusteredWriter.write(ClusteredWriter.java:96)
        at org.apache.iceberg.io.ClusteredDataWriter.write(ClusteredDataWriter.java:31)
        at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:758)
        at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:728)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:442)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

這是我在 spark 3.3 上運行的查詢，帶有膠水目錄並保存到 s3。 冰山版本是1.1.0——

USING iceberg
PARTITIONED BY (year, month, day, hour)
AS SELECT * from data

但是當我嘗試在不分區的情況下保存數據時，它沒有任何問題 -

CREATE TABLE my_catalog.test.iceberg_test
USING iceberg
PARTITIONED BY (year, month, day, hour)
AS SELECT * from data

我該如何解決？

Answer 1

根據文檔，數據需要在保存之前進行排序 -

Iceberg 要求在寫入分區表之前根據每個任務（Spark 分區）的分區規范對數據進行排序。這適用於使用 SQL 寫入和使用數據幀寫入。

所以這就是我解決問題的方法 -

df = spark.read.orc("s3a://...")
df = df.sortWithinPartitions("year", "month", "day", "hour")
df.createOrReplaceTempView("data")

然后毫無問題地運行分區 sql 查詢。

使用 s3 和膠水時無法以冰山格式保存分區數據

問題描述

1 個解決方案

解決方案1
0 2023-02-01 10:33:05

使用 s3 和膠水時無法以冰山格式保存分區數據

問題描述

1 個解決方案

解決方案1 0 2023-02-01 10:33:05

解決方案1
0 2023-02-01 10:33:05