胶水爬虫中排除的文件夹在 Athena 中抛出 HIVE_BAD_DATA 错误

Question

I'm trying to create a glue crawler to crawl a specific path pattern.我正在尝试创建一个胶水爬虫来爬取特定的路径模式。 I have the following paths:我有以下路径：

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

The same pattern is repeated every day, ie we have the above for每天重复相同的模式，即我们有上面的

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

I only want to crawl what's in the **/predictions folders each day.我只想每天抓取**/predictions文件夹中的内容。 I've set up a glue crawler pointing to bucket/inference/ , and have the following exclude patterns:我已经设置了一个指向bucket/inference/的胶水爬虫，并且具有以下排除模式：

**/modelling/**
**/extract/**

The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz and bucket/inference/2022/04/28/extract/data.parquet files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.日志正确显示bucket/inference/2022/04/28/modelling/metadata.tar.gz和bucket/inference/2022/04/28/extract/data.parquet文件被排除在外，DDL 元数据显示它在数据中选择了正确数量的对象和行。

However, when I go to SELECT * in Athena, I get the following error:但是，当我在 Athena 中将 go 转为SELECT *时，出现以下错误：

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it.我已经尝试了上述排除模式的每个组合，但它似乎总是在提取建模文件夹中的内容，尽管日志明确排除了它。 Am I missing something here?我在这里错过了什么吗？

Many thanks.非常感谢。

Answer 1

This is a known issue with Athena.这是 Athena 的一个已知问题。 From AWS troubleshooting documentation:来自 AWS 故障排除文档：

Athena does not recognize exclude patterns that you specify an AWS Glue crawler. Athena 无法识别您指定 AWS Glue 爬网程序的排除模式。 For example, if you have an Amazon S3 bucket that contains both.csv and.json files and you exclude the.json files from the crawler, Athena queries both groups of files.例如，如果您有一个包含 .csv 和 .json 文件的 Amazon S3 存储桶，并且您从爬网程序中排除了 .json 文件，Athena 会查询这两组文件。 To avoid this, place the files that you want to exclude in a different location.为避免这种情况，请将要排除的文件放在其他位置。

Reference: Athena reads files that I excluded from the AWS Glue crawler (AWS)参考： Athena 读取我从 AWS Glue 爬虫（AWS）中排除的文件

胶水爬虫中排除的文件夹在 Athena 中抛出 HIVE_BAD_DATA 错误

问题描述

1 个解决方案

解决方案1
1 已采纳 2022-05-05 13:56:43

胶水爬虫中排除的文件夹在 Athena 中抛出 HIVE_BAD_DATA 错误

问题描述

1 个解决方案

解决方案1 1 已采纳 2022-05-05 13:56:43

解决方案1
1 已采纳 2022-05-05 13:56:43