简体   繁体   English

频谱扫描错误:第 0 列使用字典编码,但字典页面为空

[英]Spectrum Scan Error: Column 0 uses dictionary encoding but the dictionary page is empty

I just started using AWS Glue V4.0 to generate Glue catalog tables and store the data as parquet files on S3.我刚开始使用 AWS Glue V4.0 生成 Glue 目录表并将数据作为镶木地板文件存储在 S3 上。 I use Redshift Spectrum to create an external table which allows me to read this S3 data directly from Redshift.我使用 Redshift Spectrum 创建一个外部表,它允许我直接从 Redshift 读取这个 S3 数据。

This was working with Glue V3.0, but since the upgrade to Glue V4.0, I am getting this error (edited to hide the S3 path).这适用于 Glue V3.0,但自从升级到 Glue V4.0 后,我收到此错误(已编辑以隐藏 S3 路径)。

Error: Spectrum Scan Error.
Code: 15007
Context: Parquet file 
Parquet file 'https://s3.us-east-1.amazonaws.com/<<bucket>>/<<path>>/created_date%3D2021-12-22/run-1670477101661-part-block-0-0-r-00000-snappy.parquet': 
metadata is corrupt. Column 0 uses dictionary encoding but the dictionary page is empty. 
(s3://<<bucket>>/<<path>>/created_date=2021-12-22/run-1670477101661-part-block-0-0-r-00000-snappy.parquet)
query: 42738706
location: dory_util.cpp:1445
process: worker_thread [pid=8836]

I can query the data in Athena, but not in Redshift.我可以在 Athena 中查询数据,但不能在 Redshift 中查询。 I can also read and query the parquet file if I read it in a local Spark session.如果我在本地 Spark session 中读取它,我也可以读取和查询 parquet 文件。

I tried generating the parquet files using two approaches, each without luck:我尝试使用两种方法生成镶木地板文件,每种方法都不走运:

  1. glue_context.getSink using glueparquet format; glue_context.getSink使用glueparquet格式;
  2. glue_context.write_dynamic_frame_from_catalog , using parquet format and setting useGlueParquetWriter to true . glue_context.write_dynamic_frame_from_catalog ,使用镶木地板格式并将useGlueParquetWriter设置为true

The external schema in Redshift was created like this: Redshift 中的外部模式是这样创建的:

create external schema if not exists my_ext_database from data catalog database 'my_ext_database'
    iam_role 'arn:aws:iam::123456789:role/my-role-name';

I was expecting to be able to query the external schema from Redshift.我期望能够从 Redshift 查询外部模式。 Why can't Redshift Spectrum read the data?为什么 Redshift Spectrum 无法读取数据?

I've solved it by using pyspark to write the parquet files to S3: basically something like df.write.parquet('s3://<bucket>/<path>/') .我已经通过使用 pyspark 将镶木地板文件写入 S3 解决了这个问题:基本上类似于df.write.parquet('s3://<bucket>/<path>/') Now I can access the data in Redshift using an external schema (via Redshift Spectrum).现在我可以使用外部架构(通过 Redshift Spectrum)访问 Redshift 中的数据。

This also implies that I have had to create a Glue catalog table to point to the path.这也意味着我必须创建一个 Glue 目录表来指向路径。 And I also need to maintain the partitions in the Glue catalog table.我还需要维护 Glue 目录表中的分区。 I used boto3 to automate those: but it was a fair bit of work to develop.我使用 boto3 来自动化这些:但开发工作量很大。

So basically my solution is to avoid using getSink or write_dynamic_frame_from_catalog with Glue 4.0 to produce the parquet data, as these methods produce parquet files which, at the time of writing this, cannot be read by Redshift Spectrum.所以基本上我的解决方案是避免将getSinkwrite_dynamic_frame_from_catalog与 Glue 4.0 一起使用来生成镶木地板数据,因为这些方法会生成镶木地板文件,在撰写本文时,Redshift Spectrum 无法读取这些文件。

Despite the workaround, I have a couple of remaining (but non-breaking) drawbacks:尽管有解决方法,但我还有几个剩余(但不会破坏)的缺点:

  1. Redshift introspection fails in Pycharm. This is fixed in PyCharm 2022.3.1 RC. Redshift 自省在 Pycharm 中失败。这已在 PyCharm 2022.3.1 RC 中修复。
  2. Clicking on a Glue table to view its schema in the AWS console yields the error, This table no longer exists .单击 Glue 表以在 AWS 控制台中查看其架构会产生错误, This table no longer exists
    • One workaround is to open the table in a new browser tab (using Chrome) and then the full schema is displayed.一种解决方法是在新的浏览器选项卡中打开表(使用 Chrome),然后显示完整的架构。
    • Another workaround is to trick AWS Glue by specifying the glueparquet parameter in the partition storage descriptor (ie {'StorageDescriptor': {'Parameters': {'classification': 'glueparquet'}}} ).另一种解决方法是通过在分区存储描述符中指定glueparquet参数来欺骗 AWS Glue(即{'StorageDescriptor': {'Parameters': {'classification': 'glueparquet'}}} )。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM