简体繁体 English

创建多个表的胶水爬虫

[英]Glue crawler creating multiple tables

原文 2022-10-05 12:28:25 8 1 amazon-web-services/ aws-glue/ aws-glue-data-catalog

I have 2 S3 buckets with the following format:我有 2 个 S3 存储桶，格式如下：

s3://bucket/{lob_name_1}/{table_name}/{current_date}/table_name.csv s3://bucket/{lob_name_1}/{table_name}/{current_date}/table_name.csv

s3://bucket/{lob_name_2}/{table_name}/{current_date}/table_name.csv s3://bucket/{lob_name_2}/{table_name}/{current_date}/table_name.csv

We have the same table name belonging to 2 different LOB's.我们有属于 2 个不同 LOB 的相同表名。 We have an AWS Glue crawler each for a single LOB.我们有一个 AWS Glue 爬虫，每个爬虫都用于一个 LOB。 When the crawler runs for the first LOB, the tables are created as expected.当爬网程序针对第一个 LOB 运行时，将按预期创建表。 When the crawler runs for the second LOB, the tables that are in common between LOB 1 and LOB 2 are recreated with a different name.当爬网程序针对第二个 LOB 运行时，LOB 1 和 LOB 2 之间的公共表将使用不同的名称重新创建。 Is there a way in which we can prevent the additional table from being created when the crawler for the second LOB runs?有没有一种方法可以防止在第二个 LOB 的爬虫运行时创建附加表？

1 个解决方案

There is parameter that you should be using that will fix your issue您应该使用一个参数来解决您的问题

Create a single schema for each S3 path : true为每个 S3 路径创建一个模式：true

Configuration options配置选项

Schema updates in the data store: Ignore the change and don't update the table in the data catalog.数据存储中的架构更新：忽略更改并且不更新数据目录中的表。

Inherit schema from table: Update all new and existing partitions with metadata from the table.从表中继承架构：使用表中的元数据更新所有新的和现有的分区。

Object deletion in the data store: Ignore the change and don't update the table in the data catalog. Object 数据存储中的删除：忽略更改，不更新数据目录中的表。

AWS Glue 爬虫问题 - AWS Glue Crawler issue

AWS Glue - 一个作业中的多个 RDS 表 - AWS Glue - multiple RDS tables in one job

AWS Athena 从 GLUE Crawler 输入的表中返回零记录来自 S3 - AWS Athena Return Zero Records from Tables Created by GLUE Crawler input csv from S3

aws 上胶水爬虫的更新时间表 - update schedule of a glue crawler on aws

步骤 function 挂在胶履带上的步骤 - Step function hanging on glue crawler step

Glue Crawler：目标收到的唯一事件数为0 - Glue Crawler: The number of unique events received is 0 for the target

AWS Glue Crawler 无法解析大文件（分类未知） - AWS Glue Crawler cannot parse large files (classification UNKNOWN)

Glue 爬虫无法分类大小 > 20 mb 的 JSON 数据 - Glue crawler could not classify JSON data of size > 20 mb

AWS Glue Crawler - 仅爬取新文件夹 - 内部服务异常 - AWS Glue Crawler - Crawl new folders only - Internal Service Exception

无法从单个存储桶在 AWS glue 中设置多个表 - Having trouble setting up multiple tables in AWS glue from a single bucket

暂无

暂无

声明:本站的技术帖子网页，遵循CC BY-SA 4.0协议，如果您需要转载，请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 AWS Glue 爬虫问题 - AWS Glue Crawler issue AWS Glue - 一个作业中的多个 RDS 表 - AWS Glue - multiple RDS tables in one job AWS Athena 从 GLUE Crawler 输入的表中返回零记录来自 S3 - AWS Athena Return Zero Records from Tables Created by GLUE Crawler input csv from S3 aws 上胶水爬虫的更新时间表 - update schedule of a glue crawler on aws 步骤 function 挂在胶履带上的步骤 - Step function hanging on glue crawler step Glue Crawler：目标收到的唯一事件数为0 - Glue Crawler: The number of unique events received is 0 for the target AWS Glue Crawler 无法解析大文件（分类未知） - AWS Glue Crawler cannot parse large files (classification UNKNOWN) Glue 爬虫无法分类大小 > 20 mb 的 JSON 数据 - Glue crawler could not classify JSON data of size > 20 mb AWS Glue Crawler - 仅爬取新文件夹 - 内部服务异常 - AWS Glue Crawler - Crawl new folders only - Internal Service Exception 无法从单个存储桶在 AWS glue 中设置多个表 - Having trouble setting up multiple tables in AWS glue from a single bucket

相关标签

粤ICP备18138465号 © 2020-2024 STACKOOM.COM