简体   繁体   English

使用 s3 和胶水时无法以冰山格式保存分区数据

[英]Unable to save partitioned data in in iceberg format when using s3 and glue

Getting the following error-出现以下错误-

 java.lang.IllegalStateException: Incoming records violate the writer assumption that records are clustered by spec and by partition within each spec. Either cluster the incoming records or switch to fanout writers.
Encountered records that belong to already closed files:
partition 'year=2022/month=10/day=8/hour=12' in spec [
  1000: year: identity(24)
  1001: month: identity(25)
  1002: day: identity(26)
  1003: hour: identity(27)
]
        at org.apache.iceberg.io.ClusteredWriter.write(ClusteredWriter.java:96)
        at org.apache.iceberg.io.ClusteredDataWriter.write(ClusteredDataWriter.java:31)
        at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:758)
        at org.apache.iceberg.spark.source.SparkWrite$PartitionedDataWriter.write(SparkWrite.java:728)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.$anonfun$run$1(WriteToDataSourceV2Exec.scala:442)
        at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1538)
        at org.apache.spark.sql.execution.datasources.v2.DataWritingSparkTask$.run(WriteToDataSourceV2Exec.scala:480)
        at org.apache.spark.sql.execution.datasources.v2.V2TableWriteExec.$anonfun$writeWithV2$2(WriteToDataSourceV2Exec.scala:381)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
        at org.apache.spark.scheduler.Task.run(Task.scala:136)
        at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:548)
        at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1504)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:551)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)

This is the query i am running on spark 3.3, with glue catalog and saving to s3.这是我在 spark 3.3 上运行的查询,带有胶水目录并保存到 s3。 The iceberg version is 1.1.0 -冰山版本是1.1.0——

USING iceberg
PARTITIONED BY (year, month, day, hour)
AS SELECT * from data

But when I try to save the data without partitioning, it works without any problems -但是当我尝试在不分区的情况下保存数据时,它没有任何问题 -

CREATE TABLE my_catalog.test.iceberg_test
USING iceberg
PARTITIONED BY (year, month, day, hour)
AS SELECT * from data 

How do I fix this?我该如何解决?

According to the docs , the data needs to be sorted before saving it -根据文档,数据需要在保存之前进行排序 -

Iceberg requires the data to be sorted according to the partition spec per task (Spark partition) in prior to write against partitioned table.This applies both Writing with SQL and Writing with DataFrames. Iceberg 要求在写入分区表之前根据每个任务(Spark 分区)的分区规范对数据进行排序。这适用于使用 SQL 写入和使用数据帧写入。

So this is how I fixed the issue -所以这就是我解决问题的方法 -

df = spark.read.orc("s3a://...")
df = df.sortWithinPartitions("year", "month", "day", "hour")
df.createOrReplaceTempView("data")

and then ran the partitioned sql query without any problem.然后毫无问题地运行分区 sql 查询。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 AWS Glue 在两个 S3 存储桶之间加载数据时如何更新数据? - How to update data when loading it between two S3 buckets using AWS Glue? 如何使用 Glue 作业将 JSON 从 s3 转换为 CSV 文件并将其保存在同一个 s3 存储桶中 - How to convert JSON to CSV file from s3 and save it in same s3 bucket using Glue job 使用 AWS Glue 将数据从 S3 加载到 Aurora Serverless - Load data from S3 into Aurora Serverless using AWS Glue 在 AWS Glue 中使用 S3 文件夹结构作为元数据 - Using S3 folder structure as meta data in AWS Glue 使用 Glue 将数据从 RDS 移动到 S3 - Moving data from RDS to S3 using Glue “分区数据”是什么意思 - S3 - what does it mean "partitioned data" - S3 如何使用 python 从 AWS S3 读取在列上分区的镶木地板文件数据 - How to read parquet file data partitioned on column from AWS S3 using python 在 AWS Glue ETL 作业中从 S3 加载分区的 json 文件 - Load partitioned json files from S3 in AWS Glue ETL jobs 如何基于S3分区数据在snowflake中创建外部表 - How to Create external table in snowflake based on S3 partitioned data 将数据从 dynamodb 发送到 s3 时,AWS Glue 作业书签不起作用 - AWS Glue Job-bookmark not working when sending data from dynamodb to s3
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM