简体繁体 English

随着 AWS Glue datacatalog 元数据大小变大，如何处理失败的 Athena 查询？

[英]How to deal with failing Athena queries as AWS Glue datacatalog metada size grows large?

原文 2023-01-25 05:36:07 9 1 amazon-web-services/ aws-glue/ amazon-athena

Based on my research, the easiest and the most straight forward way to get metadata out of Glue's Data Catalog, is using Athena and querying the information_schema database.根据我的研究，从 Glue 的数据目录中获取元数据的最简单和最直接的方法是使用 Athena 并查询information_schema数据库。 The article below has come up frequently in my research and is written by Amazon's team:以下文章在我的研究中经常出现，由亚马逊团队撰写：

Querying AWS Glue Data Catalog 查询 AWS Glue 数据目录

However, under the section titled Considerations and limitations the following is written:但是，在标题为“注意事项和限制”的部分下，写了以下内容：

Querying information_schema is most performant if you have a small to moderate amount of AWS Glue metadata.如果您有少量到中等数量的 AWS Glue 元数据，查询 information_schema 的性能最高。 If you have a large amount of metadata, errors can occur.如果您有大量元数据，则可能会出现错误。

Unfortunately, in this article, there do not seem to be any indications or suggestion regarding what constitutes as "large amount of metadata" and exactly what errors could occur when the metadata is large and one needs to query the metadata.不幸的是，在这篇文章中，似乎没有任何关于什么是“大量元数据”以及当元数据很大并且需要查询元数据时会发生什么错误的指示或建议。 My question is, how to deal with the issue related to the ever growing size of data catalog's metadata so that one would never encounter errors when using Athena to query the metadata?我的问题是，如何处理与数据目录的元数据大小不断增长相关的问题，以便在使用 Athena 查询元数据时永远不会遇到错误？ Is there a best practice for this?有这方面的最佳实践吗？ Or perhaps a better solution for getting the same metadata that querying the catalog using Athena provides without multiple or great many API calls (using boto3, Hive DDL etc)?或者也许有更好的解决方案来获取使用 Athena 查询目录所提供的相同元数据，而无需多次或大量 API 调用（使用 boto3、Hive DDL 等）？

1 个解决方案

I talked to AWS Support and did some research on this.我与 AWS Support 进行了交谈，并对此进行了一些研究。 Here's what I gathered:这是我收集的：

The information_schema is built at query execution time, there doesn't seem to be any caching. information_schema是在查询执行时构建的，似乎没有任何缓存。
If you access information_schema.tables , it will make separate calls for each schema you have to the Hive Metastore (Glue Data Catalog).如果您访问information_schema.tables ，它将对您拥有的每个模式分别调用 Hive Metastore（粘合数据目录）。
If you access information_schema.columns , it will make separate calls for each schema and each table in that schema you have to the Hive Metastore.如果您访问information_schema.columns ，它将对每个模式和该模式中的每个表分别调用 Hive Metastore。
These queries are affected by the general service quotas .这些查询受一般服务配额的影响。 In this case, DML queries like your select must finish within 30 minutes.在这种情况下，像您的select这样的 DML 查询必须在 30 分钟内完成。

If your Glue Data Catalog has many thousands of schemas, tables, and columns all of this may result in slow performance.如果您的 Glue 数据目录有数千个模式、表和列，所有这些都可能导致性能下降。 As a rough guesstimate support told me that you should be fine as long as you have less than ~ 10000 tables, which should be the case for most people.作为一个粗略的估计，支持人员告诉我，只要你的表少于 ~ 10000 个，你就应该没问题，这对大多数人来说应该是这种情况。