简体繁体 English

加载 AWS Glue S3 源数据

[英]Loading AWS Glue S3 Source Data

原文 2021-02-11 09:12:58 9 1 amazon-web-services/ amazon-s3/ aws-glue

I have a use case where AWS Glue is a good fit for data transformation.我有一个用例，其中 AWS Glue 非常适合数据转换。

However, the source file for this transformation job is retrieved via a HTTPs call which can take 45 mins to return.但是，此转换作业的源文件是通过 HTTPs 调用检索的，该调用可能需要 45 分钟才能返回。

What is the best approach to load this data to S3 and then sftp the glue output once completed?将此数据加载到 S3 的最佳方法是什么，然后在完成后 sftp 胶水 output？

This job needs to be both scheduled and run on demand.此作业需要安排并按需运行。

1 个解决方案

I don't think there's a way of loading the data directly from Glue for HTTP/s now.我认为现在没有办法直接从 Glue for HTTP/s 加载数据。

You can create a lambda or a EC2 instance with a service that extracts that source file and puts that file into an s3 bucket.您可以创建一个 lambda 或 EC2 实例，该实例具有提取该源文件并将该文件放入 s3 存储桶的服务。

I would suggest you to do this:我建议你这样做：

Create an empty S3 bucket创建一个空的 S3 存储桶
Create a lambda or an EC2 instance with a service that extracts the source file and loads it into the previous S3 bucket创建一个 lambda 或一个 EC2 实例，其服务提取源文件并将其加载到之前的 S3 存储桶中
Configure the previous bucket with notifications when a new file is PUT/uploaded into the bucket so it notifies an SNS topic.当新文件被 PUT/上传到存储桶中时，为之前的存储桶配置通知，以便通知 SNS 主题。
Create an AWS lambda subscribed to the previous SNS topic that triggers the Glue Job with your transformations.创建一个订阅了先前 SNS 主题的 AWS lambda，该主题通过您的转换触发 Glue 作业。
Create a Glue Job that loads the data from the S3 bucket you create, do the transformations and load it into S3.创建一个 Glue 作业，从您创建的 S3 存储桶中加载数据，进行转换并将其加载到 S3 中。