[英]Spark SQL queries against Delta Lake Tables using Symlink Format Manifest
I'm running spark 3.1.1 and an AWS emr-6.3.0 cluster with the following hive/metastore configurations:我正在运行 spark 3.1.1 和具有以下配置单元/元存储配置的 AWS emr-6.3.0 集群:
spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension
hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
I have a Delta Lake table defined in AWS Glue that points to an s3 location, specifically to a symlink manifest file (for Redshift Spectrum integration):我在 AWS Glue 中定义了一个 Delta Lake 表,它指向一个 s3 位置,特别是一个符号链接清单文件(用于 Redshift Spectrum 集成):
location=s3://<bucket>/<database>/<table>/_symlink_format_manifest
When I run spark spark sql queries (either from a spark application or spark-shell) against the table (ie "select * from database.table limit 10"), I get the following exception:当我针对表(即“select * from database.table limit 10”)运行 spark spark sql 查询(来自 spark 应用程序或 spark-shell)时,出现以下异常:
Caused by: java.lang.RuntimeException: s3://<bucket?/<database>/<table>/_symlink_format_manifest/manifest is not a Parquet file. expected magic number at tail [80, 65, 82, 49] but found [117, 101, 116, 10]
It seems that spark SQL / hive is not set up to read the manifest file that the external table table defined in Glue is pointing to to enable redshift integration.似乎没有设置 spark SQL / hive 来读取 Glue 中定义的外部表表指向的清单文件以启用 redshift 集成。 I'm wondering if anyone else has encountered / troubleshooted this problem and if there's a way to configure spark or hive to correctly read the Delta Lake parquet files without changing the external table definition in Glue (which is pointing to a manifest ).
我想知道是否有其他人遇到/解决了这个问题,是否有办法配置 spark 或 hive 以正确读取 Delta Lake 镶木地板文件,而无需更改 Glue 中的外部表定义(指向 manifest )。
I ended up figuring this out myself.我最终自己解决了这个问题。
You can save yourself a lot of pain and misunderstanding by grasping the distinction between querying a delta lake external table (via glue) and querying a delta lake table directly, see: https://docs.delta.io/latest/delta-batch.html#read-a-table通过掌握查询 delta lake 外部表(通过胶水)和直接查询 delta lake 表之间的区别,您可以避免很多痛苦和误解,请参阅: https://docs.delta.io/latest/delta-batch .html#read-a-table
In order to query the delta lake table directly without having to interact or go through the external table, simple change the table reference in your spark sql query to the following format:为了直接查询 delta lake 表而无需通过外部表进行交互或 go,只需将 spark sql 查询中的表引用更改为以下格式:
delta.`<table-path>`
For example,例如,
spark.sql("""select * from delta.`s3://<bucket>/<key>/<table-name>/` limit 10""")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.