简体   繁体   English

使用 Redshift 读取没有清单文件的增量表

[英]Reading a Delta Table with no Manifest File using Redshift

My goal is to read a Delta Table on AWS S3 using Redshift.我的目标是使用 Redshift 读取 AWS S3 上的增量表。 I've read through theRedshift Spectrum to Delta Lake Integration and noticed that it mentions to generate a manifest using Apache Spark using:我已经通读了Redshift Spectrum to Delta Lake Integration并注意到它提到使用 Apache Spark 使用以下命令生成清单:

GENERATE symlink_format_manifest FOR TABLE delta.`<path-to-delta-table>`

or要么

DeltaTable deltaTable = DeltaTable.forPath(<path-to-delta-table>);
deltaTable.generate("symlink_format_manifest");

However, there doesn't seem to be support to generate these manifest files for Apache Flink and the respective Delta Standalone Library that it uses.但是,似乎不支持为 Apache Flink 及其使用的相应Delta 独立库生成这些清单文件。 This is the underlying software that writes data to the Delta Table.这是将数据写入Delta Table的底层软件。

How can I either get around this limitation?我怎样才能绕过这个限制?

This functionality seems to now be supported on AWS: AWS 现在似乎支持此功能:

With today's launch, Glue crawler is adding support for creating AWS Glue Data Catalog tables for native Delta Lake tables and does not require generating manifest files.随着今天的发布,Glue 爬虫增加了对为原生 Delta Lake 表创建 AWS Glue 数据目录表的支持,并且不需要生成清单文件。 This improves customer experience because now you don't have to regenerate manifest files whenever a new partition becomes available or a table's metadata changes.这改善了客户体验,因为现在您不必在新分区可用或表的元数据发生更改时重新生成清单文件。

https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/ https://aws.amazon.com/blogs/big-data/introducing-native-delta-lake-table-support-with-aws-glue-crawlers/

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM