简体   繁体   English

AWS Glue 可以爬取 Delta Lake 表数据吗?

[英]Can AWS Glue crawl Delta Lake table data?

According to the article by Databricks, it is possible to integrate delta lake with AWS Glue.根据 Databricks 的文章,可以将 Delta Lake 与 AWS Glue 集成。 However, I am not sure if it is possible to do it also outside of Databricks platform.但是,我不确定是否也可以在 Databricks 平台之外进行。 Has someone done that?有人这样做过吗? Also, is it possible to add Delta Lake related metadata using Glue crawlers?另外,是否可以使用 Glue 爬虫添加 Delta Lake 相关元数据?

This is not possible.这是不可能的。 Although you can crawl the S3 delta files outside the databrics platform but you won't find the data in the tables.虽然您可以在 databrics 平台之外爬取 S3 增量文件,但您不会在表中找到数据。

As per the doc , it says below:根据文档,它在下面说:

Warning警告

Do not use AWS Glue Crawler on the location to define the table in AWS Glue.不要在位置上使用 AWS Glue Crawler 来定义 AWS Glue 中的表。 Delta Lake maintains files corresponding to multiple versions of the table, and querying all the files crawled by Glue will generate incorrect results. Delta Lake 维护着多个版本的表对应的文件,查询所有被 Glue 爬取的文件会产生不正确的结果。

I am currently using a solution to generate manifests of Delta tables using Apache Spark ( https://docs.delta.io/latest/presto-integration.html#language-python ).我目前正在使用一种解决方案来使用 Apache Spark ( https://docs.delta.io/latest/presto-integration.html#language-python ) 生成 Delta 表的清单。

I generate a manifest file for each Delta Table using:我使用以下方法为每个 Delta 表生成一个清单文件:

deltaTable = DeltaTable.forPath(<path-to-delta-table>)
deltaTable.generate("symlink_format_manifest")

Then created the table using the example below.然后使用下面的示例创建表。 The DDL below also creates the table inside Glue Catalog;下面的 DDL 还在 Glue Catalog 中创建了表; you can then access the data from AWS Glue using Glue Data Catalog.然后,您可以使用 Glue 数据目录从 AWS Glue 访问数据。

CREATE EXTERNAL TABLE mytable ([(col_name1 col_datatype1, ...)])
[PARTITIONED BY (col_name2 col_datatype2, ...)]
ROW FORMAT SERDE 
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '<path-to-delta-table>/_symlink_format_manifest/'  -- location of 
the generated manifest

It would be better if you could clarify what do you mean by saying "integrate delta lake with AWS Glue"..如果您能通过说“将 delta Lake 与 AWS Glue 集成”来澄清您的意思会更好。

At this moment, there is no direct Glue API for Delta lake support, however, you could write customized code using delta lake library to save output as a Delta lake.目前,没有直接的 Glue API 支持 Delta Lake,但是您可以使用 Delta Lake 库编写自定义代码,将 output 保存为 Delta Lake。

To use Crawler to add meta of Delta lakes to Catalog, here is a workaround.要使用 Crawler 将 Delta 湖泊的元数据添加到 Catalog,这是一种解决方法。 The workaround is not pretty and has two major parts.解决方法并不漂亮,有两个主要部分。

1) Get the manifest of referenced files of the Delta Lake. 1) 获取 Delta Lake 引用文件的清单。 You could refer to Delta Lake source code, or play with the logs in _delta_log, or use a brutal method such as你可以参考 Delta Lake 源码,或者玩弄 _delta_log 中的日志,或者使用粗暴的方法比如

import org.apache.spark.sql.functions.input_file_name

spark.read.format("delta")
  .load(<path-to-delta-lake>)
  .select(input_file_name)
  .distinct

2) Use Scala or Python Glue API and the manifest to create or update table in Catalog. 2) 使用 Scala 或 Python 胶水 API和清单在目录中创建或更新表。

It is finally possible to use AWS Glue Crawlers to detect and catalog Delta Tables.终于可以使用 AWS Glue 爬虫来检测和编目增量表。

Here is a blog post explaining how to do it.这是一篇博客文章,解释了如何做到这一点。

AWS Glue Crawler allows us to update metadata from delta table transaction logs to Glue metastore. AWS Glue Crawler 允许我们将元数据从增量表事务日志更新到 Glue Metastore。 Ref - https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-delta-lake参考 - https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-delta-lake

But there are a few downsides to it -但它有一些缺点 -

  • It creates a symlink table in Glue metastore它在 Glue Metastore 中创建一个符号链接表
  • This symlink-based approach wouldn't work well in case of multiple versions of the table, since the manifest file would point to the latest version这种基于符号链接的方法在表的多个版本的情况下效果不佳,因为清单文件将指向最新版本
  • There is no identifier in glue metadata to identify if given table is Delta, in case you have different types of tables in your metastore胶水元数据中没有标识符来识别给定表是否为 Delta,以防元存储中有不同类型的表
  • Any execution engine which access delta table via manifest files, wouldn't be utilizing other auxiliary data in transaction logs like column stats任何通过清单文件访问增量表的执行引擎都不会使用事务日志中的其他辅助数据,如列统计信息

Yes it is possible but only recently.是的,这是可能的,但只是最近。

See the attached AWS Blog entry for details on this just announced capability.有关此刚刚宣布的功能的详细信息,请参阅随附的 AWS 博客条目。

https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/ https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM