简体繁体 English

AWS Glue Crawler 在没有 Glue Job 的情况下将所有数据发送到 Glue Catalog 和 Athena

[英]AWS Glue Crawler sends all data to Glue Catalog and Athena without Glue Job

原文 2021-10-08 14:50:54 0 1 amazon-web-services/ aws-glue/ aws-glue-data-catalog

I have new to AWS Glue.我刚接触 AWS Glue。 I am using AWS Glue Crawler to crawl data from two S3 buckets.我正在使用 AWS Glue Crawler 从两个 S3 存储桶中抓取数据。 I have one file in each bucket.我在每个桶中有一个文件。 AWS Glue Crawler creates two tables in AWS Glue Data Catalog and I am also able to query the data in AWS Athena. AWS Glue Crawler 在 AWS Glue 数据目录中创建了两个表，我还能够在 AWS Athena 中查询数据。

My understanding was in order to get data in Athena I need to create Glue job and that will pull the data in Athena but I was wrong.我的理解是为了在 Athena 中获取数据，我需要创建 Glue 作业，这将在 Athena 中提取数据，但我错了。 Is it correct to say that Glue crawler places data in Athena without the need of Glue job and if we need to push our data in DB like SQL, Oracle etc. then we need to Glue Job?说 Glue 爬虫不需要 Glue 作业就将数据放在 Athena 中是否正确，如果我们需要将数据推送到数据库中，例如 SQL、Oracle 等，那么我们需要 Glue 作业吗？

How I can configure the Glue Crawler that it fetches only the delta data and not all data all the time from the source bucket?我如何配置 Glue Crawler 使其只获取增量数据而不是始终从源存储桶中获取所有数据？

Any help is appreciated?任何帮助表示赞赏？

1 个解决方案

The Glue crawler is only used to identify the schema that your data is in. Your data sits somewhere (eg S3) and the crawler identifies the schema by going through a percentage of your files. Glue 爬虫仅用于识别您的数据所在的模式。您的数据位于某个地方（例如 S3），爬虫通过浏览一定百分比的文件来识别模式。

You then can use a query engine like Athena (managed, serverless Apache Presto) to query the data, since it already has a schema.然后，您可以使用像 Athena（托管、无服务器 Apache Presto）这样的查询引擎来查询数据，因为它已经有一个模式。

If you want to process / clean / aggregate the data, you can use Glue Jobs, which is basically managed serverless Spark.如果你想处理/清理/聚合数据，你可以使用 Glue Jobs，它基本上是托管的无服务器 Spark。