简体   繁体   English

创建多个表的胶水爬虫

[英]Glue crawler creating multiple tables

I have 2 S3 buckets with the following format:我有 2 个 S3 存储桶,格式如下:

  1. s3://bucket/{lob_name_1}/{table_name}/{current_date}/table_name.csv s3://bucket/{lob_name_1}/{table_name}/{current_date}/table_name.csv
  2. s3://bucket/{lob_name_2}/{table_name}/{current_date}/table_name.csv s3://bucket/{lob_name_2}/{table_name}/{current_date}/table_name.csv

We have the same table name belonging to 2 different LOB's.我们有属于 2 个不同 LOB 的相同表名。 We have an AWS Glue crawler each for a single LOB.我们有一个 AWS Glue 爬虫,每个爬虫都用于一个 LOB。 When the crawler runs for the first LOB, the tables are created as expected.当爬网程序针对第一个 LOB 运行时,将按预期创建表。 When the crawler runs for the second LOB, the tables that are in common between LOB 1 and LOB 2 are recreated with a different name.当爬网程序针对第二个 LOB 运行时,LOB 1 和 LOB 2 之间的公共表将使用不同的名称重新创建。 Is there a way in which we can prevent the additional table from being created when the crawler for the second LOB runs?有没有一种方法可以防止在第二个 LOB 的爬虫运行时创建附加表?

There is parameter that you should be using that will fix your issue您应该使用一个参数来解决您的问题

Create a single schema for each S3 path : true为每个 S3 路径创建一个模式:true

Configuration options配置选项

Schema updates in the data store: Ignore the change and don't update the table in the data catalog.数据存储中的架构更新:忽略更改并且不更新数据目录中的表。

Inherit schema from table: Update all new and existing partitions with metadata from the table.从表中继承架构:使用表中的元数据更新所有新的和现有的分区。

Object deletion in the data store: Ignore the change and don't update the table in the data catalog. Object 数据存储中的删除:忽略更改,不更新数据目录中的表。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM