简体   繁体   English

胶水爬虫中排除的文件夹在 Athena 中抛出 HIVE_BAD_DATA 错误

[英]Excluded folder in glue crawler throws HIVE_BAD_DATA error in Athena

I'm trying to create a glue crawler to crawl a specific path pattern.我正在尝试创建一个胶水爬虫来爬取特定的路径模式。 I have the following paths:我有以下路径:

bucket/inference/2022/04/28/modelling/metadata.tar.gz
bucket/inference/2022/04/28/prediction/predictions.parquet
bucket/inference/2022/04/28/extract/data.parquet

The same pattern is repeated every day, ie we have the above for每天重复相同的模式,即我们有上面的

bucket/inference/2022/04/29/*
bucket/inference/2022/04/30/*

I only want to crawl what's in the **/predictions folders each day.我只想每天抓取**/predictions文件夹中的内容。 I've set up a glue crawler pointing to bucket/inference/ , and have the following exclude patterns:我已经设置了一个指向bucket/inference/的胶水爬虫,并且具有以下排除模式:

**/modelling/**
**/extract/**

The logs correctly show that the bucket/inference/2022/04/28/modelling/metadata.tar.gz and bucket/inference/2022/04/28/extract/data.parquet files are being excluded, and the DDL metadata shows that it's picking up the correct number of objects and rows in the data.日志正确显示bucket/inference/2022/04/28/modelling/metadata.tar.gzbucket/inference/2022/04/28/extract/data.parquet文件被排除在外,DDL 元数据显示它在数据中选择了正确数量的对象和行。

However, when I go to SELECT * in Athena, I get the following error:但是,当我在 Athena 中将 go 转为SELECT *时,出现以下错误:

HIVE_BAD_DATA: Not valid Parquet file: s3://bucket/inference/2022/04/28/modelling/metadata.tar.gz expected magic number: PAR1

I've tried every combo of the above exclude patterns, but it always seems to be picking up what's in the modelling folder, despite the logs explicitly excluding it.我已经尝试了上述排除模式的每个组合,但它似乎总是在提取建模文件夹中的内容,尽管日志明确排除了它。 Am I missing something here?我在这里错过了什么吗?

Many thanks.非常感谢。

This is a known issue with Athena.这是 Athena 的一个已知问题。 From AWS troubleshooting documentation:来自 AWS 故障排除文档:

Athena does not recognize exclude patterns that you specify an AWS Glue crawler. Athena 无法识别您指定 AWS Glue 爬网程序的排除模式。 For example, if you have an Amazon S3 bucket that contains both.csv and.json files and you exclude the.json files from the crawler, Athena queries both groups of files.例如,如果您有一个包含 .csv 和 .json 文件的 Amazon S3 存储桶,并且您从爬网程序中排除了 .json 文件,Athena 会查询这两组文件。 To avoid this, place the files that you want to exclude in a different location.为避免这种情况,请将要排除的文件放在其他位置。

Reference: Athena reads files that I excluded from the AWS Glue crawler (AWS)参考: Athena 读取我从 AWS Glue 爬虫(AWS)中排除的文件

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Athena 查询错误 HIVE_BAD_DATA:无效的 Parquet 文件。 csv /.元数据 - Athena query error HIVE_BAD_DATA: Not valid Parquet file . csv / .metadata AWS Glue Crawler 在没有 Glue Job 的情况下将所有数据发送到 Glue Catalog 和 Athena - AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job 从 Athena 获取数据和胶水权限 - Fetching data from Athena and glue permissions 通过 Athena 创建 Glue 数据目录 SDK - Create Glue data catalog via Athena SDK AWS Athena 从 GLUE Crawler 输入的表中返回零记录来自 S3 - AWS Athena Return Zero Records from Tables Created by GLUE Crawler input csv from S3 AWS Athena 从从 S3 的 GLUE 爬虫输入 csv 创建的表中返回零记录 - AWS Athena Returning Zero Records from Tables Created from GLUE Crawler input csv from S3 AWS Glue Crawler:想要 s3 中文件夹的单独表 - AWS Glue Crawler: want separate table for folder in s3 如何创建 Athena 堆栈并使用 Glue Data 目录? - How to create an Athena stack and consume Glue Data catalog? Glue 爬虫无法分类大小 > 20 mb 的 JSON 数据 - Glue crawler could not classify JSON data of size > 20 mb AWS athena 在尝试查询 S3 中已在 Glue 数据目录中编目的文件时出错 - AWS athena giving error when trying to query files in S3 that have already been catalogued in Glue data catalog
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM