简体   繁体   English

在 Glue 数据目录中为 S3 和未知模式中的数据创建表

[英]Create tables in Glue Data Catalog for data in S3 and unknown schema

My current use case is, in an ETL based service ( NOTE : The ETL service is not using the Glue ETL, it is an independent service), I am getting some data from AWS Redshift clusters into the S3.我目前的用例是,在基于 ETL 的服务中( NOTE :ETL 服务不使用 Glue ETL,它是一项独立服务),我从 AWS Redshift 集群获取一些数据到 S3。 The data in S3 is then fed into the T and L jobs.然后将 S3 中的数据输入到 T 和 L 作业中。 I want to populate the metadata into the Glue Catalog.我想将元数据填充到 Glue 目录中。 The most basic solution for this is to use the Glue Crawler, but the crawler runs for approximately 1 hour and 20 mins(lot of s3 partitions).最基本的解决方案是使用 Glue Crawler,但爬虫运行大约 1 小时 20 分钟(很多 s3 分区)。 The other solution that I came across is to use Glue API's.我遇到的另一个解决方案是使用 Glue API。 However, I am facing the issue of data type definition in the same.但是,我同样面临数据类型定义的问题。

Is there any way, I can create/update the Glue Catalog Tables where I have data in S3 and the data types are known only during the extraction process.有什么办法,我可以创建/更新我在 S3 中有数据的 Glue 目录表,并且数据类型仅在提取过程中是已知的。

But also, when the T and L jobs are being run, the data types should be readily available in the catalog.而且,当 T 和 L 作业正在运行时,数据类型应该在目录中随时可用。

In order to create, update the data catalog during your ETL process, you can make use of the following:为了在 ETL 过程中创建、更新数据目录,您可以使用以下内容:

Update :更新

additionalOptions = {"enableUpdateCatalog": True, "updateBehavior": "UPDATE_IN_DATABASE"}
additionalOptions["partitionKeys"] = ["partition_key0", "partition_key1"]

sink = glueContext.write_dynamic_frame_from_catalog(frame=last_transform, database=<dst_db_name>,
                                                    table_name=<dst_tbl_name>, transformation_ctx="write_sink",
                                                    additional_options=additionalOptions)
job.commit()

The above can be used to update the schema.以上可用于更新架构。 You also have the option to set the updateBehavior choosing between LOG or UPDATE_IN_DATABASE ( default ).您还可以选择在LOGUPDATE_IN_DATABASE默认)之间设置updateBehavior

Create创造

To create new tables in the data catalog during your ETL you can follow this example:要在 ETL 期间在数据目录中创建新表,您可以按照以下示例进行操作:

sink = glueContext.getSink(connection_type="s3", path="s3://path/to/data",
                           enableUpdateCatalog=True, updateBehavior="UPDATE_IN_DATABASE",
                           partitionKeys=["partition_key0", "partition_key1"])
sink.setFormat("<format>")
sink.setCatalogInfo(catalogDatabase=<dst_db_name>, catalogTableName=<dst_tbl_name>)
sink.writeFrame(last_transform)

You can specify the database and new table name using setCatalogInfo .您可以使用setCatalogInfo指定数据库和新表名。

You also have the option to update the partitions in the data catalog using the enableUpdateCatalog argument then specifying the partitionKeys .您还可以选择使用enableUpdateCatalog参数更新数据目录中的分区,然后指定partitionKeys

A more detailed explanation on the functionality can be foundhere .可以在此处找到有关该功能的更详细说明。

Found a solution to the problem, I ended up utilising the Glue Catalog API's to make it seamless and fast.找到了问题的解决方案,我最终利用 Glue Catalog API 使其无缝且快速。 I created an interface which interacts with the Glue Catalog, and override those methods for various data sources.我创建了一个与 Glue 目录交互的接口,并为各种数据源覆盖这些方法。 Right after the data has been loaded into the S3, I fire the query to get the schema from the source and then the interface does its work.在将数据加载到 S3 之后,我立即触发查询以从源获取架构,然后接口开始工作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM