简体   繁体   English

AWS Glue 从 ETL 作业加载新分区失败

[英]AWS Glue load new partitions from ETL job fails

I'm trying to use an ETL job to directly write my dataframe to a database catalog and update my partitions.我正在尝试使用 ETL 作业将我的 dataframe 直接写入数据库目录并更新我的分区。

I had a code like this:我有这样的代码:

datasink4 = glueContext.write_dynamic_frame.from_options(
    frame = dropnullfields3, 
    connection_type = "s3", 
    connection_options = {
       "path": TARGET_PATH, 
       "partitionKeys":["x", "y"]
    },
    format = "parquet", 
    transformation_ctx = "datasink4")

additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["x", "y"]


sink = glueContext.write_dynamic_frame_from_catalog(frame=dropnullfields3, 
  database=DATABASE, 
  table_name=TABLE, 
  transformation_ctx="write_sink", 
  additional_options=additionalOptions)

which worked to write the data into the catalog.它可以将数据写入目录。 However I would like to avoid the double write.但是我想避免双重写入。 So I followed the method 2 from the documentation to update partitions:https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html所以我按照文档中的方法2更新分区:https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html

And came with this code:并附带以下代码:

datasink4 = glueContext.write_dynamic_frame.from_options(
    frame = dropnullfields3, 
    connection_type = "s3", 
    connection_options = {
       "path": TARGET_PATH, 
       "partitionKeys":["x", "y"]
    },
    format = "parquet", 
    transformation_ctx = "datasink4")

sink = glueContext.getSink(connection_type="s3", path=TARGET_PATH,
                           enableUpdateCatalog=True,
                           partitionKeys=["x", "y"])
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=DATABASE, catalogTableName=TABLE)
sink.writeFrame(dropnullfields3)

But now the data can't be loaded in Athena, I get strange errors about the data structure like this:但是现在无法在 Athena 中加载数据,我收到关于数据结构的奇怪错误,如下所示:

HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error: < expected at the end of 'struct' (Service: null; Status Code: 0; Error Code: null; Request ID: null)

I have tried to recreate the table to have only the new files in glueparquet.我试图重新创建表以仅包含胶合木中的新文件。

I have also tried to run a crawler on the new glueparquet files, the table generated from the crawler can be queried.我还尝试在新的glueparquet文件上运行爬虫,可以查询从爬虫生成的表。 However when I fill the same table from the ETL job above I get always this error...但是,当我从上面的 ETL 作业中填写同一张表时,我总是会收到此错误...

You want to change the classification for the table to glueparquet您想将表格的分类更改为glueparquet

CREATE EXTERNAL TABLE `table_name`(
 ...
)
PARTITIONED BY ( 
  ...
)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
  's3://cortisol-beta-log-bucket/service_log/'
TBLPROPERTIES (
  'classification'='glueparquet')

or in CDK you need to set the dataFormat as follows:或者在 CDK 中你需要设置 dataFormat 如下:

dataFormat: new DataFormat({
                inputFormat: InputFormat.PARQUET,
                // Have to explicitly specify classification string to allow glue jobs to add partitions
                classificationString: new ClassificationString("glueparquet"),
                outputFormat: OutputFormat.PARQUET,
                serializationLibrary: SerializationLibrary.PARQUET
            }),

then you can just use the code below and it will work with athena:那么您可以使用下面的代码,它将与 athena 一起使用:

glueContext.write_dynamic_frame.from_catalog(
    frame=last_transform,
    database=args["GLUE_DATABASE"],
    table_name=args["GLUE_TABLE"],
    transformation_ctx="datasink",
    additional_options={"partitionKeys": partition_keys, "enableUpdateCatalog": True},
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM