简体   繁体   中英

Can AWS Glue crawl Delta Lake table data?

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue. However, I am not sure if it is possible to do it also outside of Databricks platform. Has someone done that? Also, is it possible to add Delta Lake related metadata using Glue crawlers?

This is not possible. Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.

As per the doc , it says below:

Warning

Do not use AWS Glue Crawler on the location to define the table in AWS Glue. Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results.

I am currently using a solution to generate manifests of Delta tables using Apache Spark ( https://docs.delta.io/latest/presto-integration.html#language-python ).

I generate a manifest file for each Delta Table using:

deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Then created the table using the example below. The DDL below also creates the table inside Glue Catalog; you can then access the data from AWS Glue using Glue Data Catalog.

CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/'  -- location of 
the generated manifest

It would be better if you could clarify what do you mean by saying "integrate delta lake with AWS Glue"..

At this moment, there is no direct Glue API for Delta lake support, however, you could write customized code using delta lake library to save output as a Delta lake.

To use Crawler to add meta of Delta lakes to Catalog, here is a workaround. The workaround is not pretty and has two major parts.

1) Get the manifest of referenced files of the Delta Lake. You could refer to Delta Lake source code, or play with the logs in _delta_log, or use a brutal method such as

import org.apache.spark.sql.functions.input_file_name

spark.read.format("delta")
  .load(<path-to-delta-lake>)
  .select(input_file_name)
  .distinct

2) Use Scala or Python Glue API and the manifest to create or update table in Catalog.

It is finally possible to use AWS Glue Crawlers to detect and catalog Delta Tables.

Here is a blog post explaining how to do it.

AWS Glue Crawler allows us to update metadata from delta table transaction logs to Glue metastore. Ref - https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-delta-lake

But there are a few downsides to it -

  • It creates a symlink table in Glue metastore
  • This symlink-based approach wouldn't work well in case of multiple versions of the table, since the manifest file would point to the latest version
  • There is no identifier in glue metadata to identify if given table is Delta, in case you have different types of tables in your metastore
  • Any execution engine which access delta table via manifest files, wouldn't be utilizing other auxiliary data in transaction logs like column stats

Yes it is possible but only recently.

See the attached AWS Blog entry for details on this just announced capability.

https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM