I'm trying to use an ETL job to directly write my dataframe to a database catalog and update my partitions.
I had a code like this:
datasink4 = glueContext.write_dynamic_frame.from_options(
frame = dropnullfields3,
connection_type = "s3",
connection_options = {
"path": TARGET_PATH,
"partitionKeys":["x", "y"]
},
format = "parquet",
transformation_ctx = "datasink4")
additionalOptions = {"enableUpdateCatalog": True}
additionalOptions["partitionKeys"] = ["x", "y"]
sink = glueContext.write_dynamic_frame_from_catalog(frame=dropnullfields3,
database=DATABASE,
table_name=TABLE,
transformation_ctx="write_sink",
additional_options=additionalOptions)
which worked to write the data into the catalog. However I would like to avoid the double write. So I followed the method 2 from the documentation to update partitions:https://docs.aws.amazon.com/glue/latest/dg/update-from-job.html
And came with this code:
datasink4 = glueContext.write_dynamic_frame.from_options(
frame = dropnullfields3,
connection_type = "s3",
connection_options = {
"path": TARGET_PATH,
"partitionKeys":["x", "y"]
},
format = "parquet",
transformation_ctx = "datasink4")
sink = glueContext.getSink(connection_type="s3", path=TARGET_PATH,
enableUpdateCatalog=True,
partitionKeys=["x", "y"])
sink.setFormat("glueparquet")
sink.setCatalogInfo(catalogDatabase=DATABASE, catalogTableName=TABLE)
sink.writeFrame(dropnullfields3)
But now the data can't be loaded in Athena, I get strange errors about the data structure like this:
HIVE_METASTORE_ERROR: com.facebook.presto.spi.PrestoException: Error: < expected at the end of 'struct' (Service: null; Status Code: 0; Error Code: null; Request ID: null)
I have tried to recreate the table to have only the new files in glueparquet.
I have also tried to run a crawler on the new glueparquet files, the table generated from the crawler can be queried. However when I fill the same table from the ETL job above I get always this error...
You want to change the classification for the table to glueparquet
CREATE EXTERNAL TABLE `table_name`(
...
)
PARTITIONED BY (
...
)
ROW FORMAT SERDE
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
LOCATION
's3://cortisol-beta-log-bucket/service_log/'
TBLPROPERTIES (
'classification'='glueparquet')
or in CDK you need to set the dataFormat as follows:
dataFormat: new DataFormat({
inputFormat: InputFormat.PARQUET,
// Have to explicitly specify classification string to allow glue jobs to add partitions
classificationString: new ClassificationString("glueparquet"),
outputFormat: OutputFormat.PARQUET,
serializationLibrary: SerializationLibrary.PARQUET
}),
then you can just use the code below and it will work with athena:
glueContext.write_dynamic_frame.from_catalog(
frame=last_transform,
database=args["GLUE_DATABASE"],
table_name=args["GLUE_TABLE"],
transformation_ctx="datasink",
additional_options={"partitionKeys": partition_keys, "enableUpdateCatalog": True},
)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.