简体   繁体   English

使用符号链接格式清单对 Delta Lake 表进行 Spark SQL 查询

[英]Spark SQL queries against Delta Lake Tables using Symlink Format Manifest

I'm running spark 3.1.1 and an AWS emr-6.3.0 cluster with the following hive/metastore configurations:我正在运行 spark 3.1.1 和具有以下配置单元/元存储配置的 AWS emr-6.3.0 集群:

spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory

I have a Delta Lake table defined in AWS Glue that points to an s3 location, specifically to a symlink manifest file (for Redshift Spectrum integration):我在 AWS Glue 中定义了一个 Delta Lake 表,它指向一个 s3 位置,特别是一个符号链接清单文件(用于 Redshift Spectrum 集成):

location=s3://<bucket>/<database>/<table>/_symlink_format_manifest

When I run spark spark sql queries (either from a spark application or spark-shell) against the table (ie "select * from database.table limit 10"), I get the following exception:当我针对表(即“select * from database.table limit 10”)运行 spark spark sql 查询(来自 spark 应用程序或 spark-shell)时,出现以下异常:

Caused by: java.lang.RuntimeException: s3://<bucket?/<database>/<table>/_symlink_format_manifest/manifest is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10]

It seems that spark SQL / hive is not set up to read the manifest file that the external table table defined in Glue is pointing to to enable redshift integration.似乎没有设置 spark SQL / hive 来读取 Glue 中定义的外部表表指向的清单文件以启用 redshift 集成。 I'm wondering if anyone else has encountered / troubleshooted this problem and if there's a way to configure spark or hive to correctly read the Delta Lake parquet files without changing the external table definition in Glue (which is pointing to a manifest ).我想知道是否有其他人遇到/解决了这个问题,是否有办法配置 spark 或 hive 以正确读取 Delta Lake 镶木地板文件,而无需更改 Glue 中的外部表定义(指向 manifest )。

I ended up figuring this out myself.我最终自己解决了这个问题。

You can save yourself a lot of pain and misunderstanding by grasping the distinction between querying a delta lake external table (via glue) and querying a delta lake table directly, see: https://docs.delta.io/latest/delta-batch.html#read-a-table通过掌握查询 delta lake 外部表(通过胶水)和直接查询 delta lake 表之间的区别,您可以避免很多痛苦和误解,请参阅: https://docs.delta.io/latest/delta-batch .html#read-a-table

In order to query the delta lake table directly without having to interact or go through the external table, simple change the table reference in your spark sql query to the following format:为了直接查询 delta lake 表而无需通过外部表进行交互或 go,只需将 spark sql 查询中的表引用更改为以下格式:

delta.`<table-path>`

For example,例如,

spark.sql("""select * from delta.`s3://<bucket>/<key>/<table-name>/` limit 10""")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 我们可以将本地 SQL 服务器数据库中的表连接到 Azure Delta 湖中的 Delta 表中的表吗? 我有什么选择 - Can we join tables in on-premise SQL Server database to tables in Delta tables in Azure Delta lake? what are my options 使用 spark sql 写入增量表 - Writing to delta table using spark sql 将数据从本地 sql 服务器复制到 Azure Data Lake Storage Gen2 中的增量格式 - copy data from on premise sql server to delta format in Azure Data Lake Storage Gen2 使用 Databricks 上的 Apache Spark 将文件写入 delta lake 会产生与读取 Data Frame 不同的结果 - Writing out file to delta lake produces different results from Data Frame read using Apache Spark on Databricks 使用 Redshift 读取没有清单文件的增量表 - Reading a Delta Table with no Manifest File using Redshift 使用 spark 在 Delta 表中进行棘手的 upsert - Tricky upsert in Delta table using spark 同时将数据从 Sql 服务器复制到 Databricks delta lake(sql notebook 活动)空白值填充为 Null 值 - while copying data from Sql server to Databricks delta lake(sql notebook activity) Blank values populating as Null values 从 delta lake 手动删除数据文件 - Manually Deleted data file from delta lake 胶水无法识别 Delta Lake Python 库 - Glue not able to recognize Delta Lake Python Library Delta 表 / 雅典娜与火花 - Delta Table / Athena And Spark
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM