简体   繁体   English

AWS Glue 作业使用外部数据 REST API

[英]AWS Glue job consuming data from external REST API

I'm trying to create a workflow where AWS Glue ETL job will pull the JSON data from external REST API instead of S3 or any other AWS-internal sources.我正在尝试创建一个工作流程,其中 AWS Glue ETL 作业将从外部 REST API 而不是 S3 或任何其他 AWS 内部来源中提取 JSON 数据。 Is that even possible?这可能吗? Anyone does it?有人这样做吗? Please help!请帮忙!

Yes, I do extract data from REST API's like Twitter, FullStory, Elasticsearch, etc. Usually, I do use the Python Shell jobs for the extraction because they are faster (relatively small cold start).是的,我确实从 Twitter、FullStory、Elasticsearch 等 REST API 中提取数据。通常,我确实使用 Python Shell 作业进行提取,因为它们更快(冷启动相对较小)。 When is finished it triggers a Spark type job that reads only the json items I need.完成后,它会触发一个 Spark 类型的作业,该作业仅读取我需要的 json 项。 I use the requests pyhton library.我使用请求 pyhton 库。

In order to save the data into S3 you can do something like this为了将数据保存到 S3 中,您可以执行以下操作

import boto3
import json

# Initializes S3 client
s3 = boto3.resource('s3')

tweets = []
//Code that extracts tweets from API
tweets_json = json.dumps(tweets)
obj = s3.Object("my-tweets", "tweets.json")
obj.put(Body=data)

The AWS Glue Python Shell executor has a limit of 1 DPU max. AWS Glue Python Shell 执行程序的最大限制为 1 DPU。 If that's an issue, like in my case, a solution could be running the script in ECS as a task.如果这是一个问题,就像在我的情况下一样,一个解决方案可能是在 ECS 中运行脚本作为一项任务。

You can run about 150 requests/second using libraries like asyncio and aiohttp in python.您可以使用 python 中的 asyncio 和 aiohttp 等库每秒运行大约 150 个请求。 example 1 , example 2 .例1例2

Then you can distribute your request across multiple ECS tasks or Kubernetes pods using Ray.然后,您可以使用 Ray 在多个 ECS 任务或 Kubernetes pod 之间分发您的请求。 Here you can find a few examples of what Ray can do for you. 在这里,您可以找到一些 Ray 可以为您做什么的示例。

This also allows you to cater for APIs with rate limiting.这也允许您满足具有速率限制的 API。

Once you've gathered all the data you need, run it through AWS Glue.收集所需的所有数据后,通过 AWS Glue 运行它。

Yes, it is possible.对的,这是可能的。 You can use Amazon Glue to extract data from REST APIs.您可以使用 Amazon Glue 从 REST API 中提取数据。 Although there is no direct connector available for Glue to connect to the internet world, you can set up a VPC, with a public and a private subnet.尽管 Glue 没有可用于连接互联网世界的直接连接器,但您可以设置一个 VPC,具有公有和私有子网。 In the private subnet, you can create an ENI that will allow only outbound connections for GLue to fetch data from the API.在私有子网中,您可以创建一个仅允许 GLue 出站连接从 API 获取数据的 ENI。 In the public subnet, you can install a NAT Gateway.在公共子网中,您可以安装 NAT 网关。

Additionally, you might also need to set up a security group to limit inbound connections.此外,您可能还需要设置安全组来限制入站连接。 Hope this answers your question.希望这能回答你的问题。

A new option since the original answer was accepted is to not use Glue at all but to build a custom connector for Amazon AppFlow .接受原始答案后的一个新选项是根本不使用 Glue 而是为Amazon AppFlow 构建自定义连接器

I would argue that AppFlow is the AWS tool most suited to data transfer between API-based data sources, while Glue is more intended for ODP-based discovery of data already in AWS.我认为 AppFlow 是最适合在基于 API 的数据源之间传输数据的 AWS 工具,而 Glue 更适合基于 ODP 发现 AWS 中已有的数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM