简体   繁体   English

在S3触发的AWS中运行ETL python脚本

[英]Run ETL python script in AWS triggered by S3

I am new with AWS and don't know how to do the following. 我是AWS的新手,不知道如何执行以下操作。 When I put an object in S3 I want to launch a python script that does some transformations and returns it to another path in S3. 当我在S3中放置一个对象时,我想启动一个python脚本,该脚本进行一些转换并将其返回到S3中的另一个路径。 I've tried a lambda function but the process takes more than 300 seconds. 我尝试了lambda函数,但是该过程需要300秒钟以上。 I've also tried it with a Glue job but I don't know how to trigger it when I put the file in S3. 我也尝试了Glue作业,但是当我将文件放入S3时我不知道如何触发它。

Does anyone know how to do it? 有人知道怎么做吗? Maybe I'm using the wrong AWS tools. 也许我使用了错误的AWS工具。

One option would be to use SQS : 一种选择是使用SQS

  1. Create the SQS queue. 创建SQS队列。
  2. Setup S3 to send notifications to the SQS queue when new objects are added to the source bucket. 设置S3以在将新对象添加到源存储桶时将通知发送到SQS队列。 See Configuring Amazon S3 Event Notifications . 请参阅配置Amazon S3事件通知
  3. Setup your Python script on an EC2 instance and listen to the SQS queue in your code. EC2实例上设置Python脚本,并在代码中侦听SQS队列。
  4. Upload the output of your Python script into the target S3 bucket after script finished. 脚本完成后,将Python脚本的输出上载到目标S3存储桶。

Can you break up the Python processing into smaller steps? 您可以将Python处理分成较小的步骤吗? I'd definitely recommend that you use Lambda instead of managing EC2 if you can get your code to run within the Lambda restrictions. 如果您可以让代码在Lambda限制内运行,我绝对建议您使用Lambda而不是管理EC2。

The simple solution for your problem is here: Since you've already mentioned that you have AWS Glue job working to do this operation. 针对您的问题的简单解决方案如下:既然您已经提到过您有AWS Glue作业正在执行此操作。 And all you don't know is how to trigger glue job when file placed in s3, I am answering to that question. 而且,您所不知道的是将文件放置在s3中时如何触发粘合作业,我正在回答这个问题。 You can write an AWS lambda using boto3 module which can be triggered based up on the s3 event and have setup glue.start_job_run command in your lambda function. 您可以使用boto3模块编写一个AWS lambda,该模块可以根据s3事件触发,并在lambda函数中设置setup.start_job_run命令。

response = client.start_job_run(
    JobName='string')

https://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.start_job_run https://boto3.readthedocs.io/en/latest/reference/services/glue.html#Glue.Client.start_job_run

Note:: I strongly believe Glue is the right tool rather than lambda for your requirement that you mentioned in question, because AWS lambda have time out limitation. 注意:我坚信Glue是正确的工具,而不是lambda,因为您提到的要求是您需要的,因为AWS lambda有超时限制。 It will get timeout after 300 seconds. 300秒后将超时。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM