带 Delta 表的胶水目录连接到 Databricks SQL 引擎

Question

我正在尝试从 Databricks SQL 引擎上的 AWS Glue 目录中查询增量表。 它们以 Delta Lake 格式存储。 我有自动模式的胶水爬虫。 该目录是使用非 Delta 表设置和运行的。 通过 databricks 的设置通过目录加载每个数据库的可用表，但由于使用 hive 而不是 delta 读取的 databricks，查询失败。

Incompatible format detected.

A transaction log for Databricks Delta was found at `s3://COMPANY/club/attachment/_delta_log`,
but you are trying to read from `s3://COMPANY/club/attachment` using format("hive"). You must use
'format("delta")' when reading and writing to a delta table.

To disable this check, SET spark.databricks.delta.formatCheck.enabled=false
To learn more about Delta, see https://docs.databricks.com/delta/index.html

SQL 仓库设置 => 数据访问配置

spark.databricks.hive.metastore.glueCatalog.enabled : true

使用 AWS 的 DELTA LAKE 设置的爬虫生成下表元数据

{
    "StorageDescriptor": {
        "cols": {
            "FieldSchema": [
                {
                    "name": "id",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "media",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "media_type",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "title",
                    "type": "string",
                    "comment": ""
                },
                {
                    "name": "type",
                    "type": "smallint",
                    "comment": ""
                },
                {
                    "name": "clubmessage_id",
                    "type": "string",
                    "comment": ""
                }
            ]
        },
        "location": "s3://COMPANY/club/attachment/_symlink_format_manifest",
        "inputFormat": "org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat",
        "outputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
        "compressed": "false",
        "numBuckets": "-1",
        "SerDeInfo": {
            "name": "",
            "serializationLib": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
            "parameters": {}
        },
        "bucketCols": [],
        "sortCols": [],
        "parameters": {
            "UPDATED_BY_CRAWLER": "CRAWLER_NAME",
            "CrawlerSchemaSerializerVersion": "1.0",
            "CrawlerSchemaDeserializerVersion": "1.0",
            "classification": "parquet"
        },
        "SkewedInfo": {},
        "storedAsSubDirectories": "false"
    },
    "parameters": {
        "UPDATED_BY_CRAWLER": "CRAWLER_NAME",
        "CrawlerSchemaSerializerVersion": "1.0",
        "CrawlerSchemaDeserializerVersion": "1.0",
        "classification": "parquet"
    }
}

Answer 1

我面临同样的问题。 看来您不能使用 Spark SQL 在 Glue 中查询增量表，因为设置

spark.databricks.hive.metastore.glueCatalog.enabled : true

表示该表将是 hive 表。 您将需要直接访问 S3 中的表，从而失去元数据目录的优势。

但是，您可以通过使用以下 IAM 策略阻止集群访问 _delta_log 文件夹来读取它：

{ "Sid": "BlockDeltaLog", "Effect": "Deny", "Action": "s3:*", "Resource": [ "arn:aws:s3:::BUCKET" ], "Condition": { "StringLike": { "s3:prefix": [ "_delta_log/" ] } } }

Answer 2

更新位置后，我能够查询由胶水爬虫创建的增量表。 在您的情况下，它需要从： s3://COMPANY/club/attachment/_symlink_format_manifest为s3://COMPANY/club/attachment

这是因为 spark 上的 delta 不像 hive 和 presto 那样查看_symlink_format_manifest 。 它只需要知道根目录。

databricks 中用于更新位置的命令如下所示：

ALTER table my_db.my_table
SET LOCATION "s3://COMPANY/club/attachment"

注意：您的数据库位置也必须设置才能使该命令正常工作

带 Delta 表的胶水目录连接到 Databricks SQL 引擎

问题描述

2 个解决方案

解决方案1
0 2022-08-12 09:29:29

解决方案2
0 2022-09-20 20:06:11

带 Delta 表的胶水目录连接到 Databricks SQL 引擎

问题描述

2 个解决方案

解决方案1 0 2022-08-12 09:29:29

解决方案2 0 2022-09-20 20:06:11

解决方案1
0 2022-08-12 09:29:29

解决方案2
0 2022-09-20 20:06:11