[英]How to run Lambda function for multiples files in AWS S3
I have the following Lambda function to run my script in AWS Glue when a new occurrence in an S3 bucket is checked.当检查 S3 存储桶中的新事件时,我有以下 Lambda 函数在 AWS Glue 中运行我的脚本。
import json
import boto3
from urllib.parse import unquote_plus
def lambda_handler(event, context):
bucketName = event["Records"][0]["s3"]["bucket"]["name"]
fileNameFull = event["Records"][0]["s3"]["object"]["key"]
fileName = unquote_plus(fileNameFull)
print(bucket, fileName)
glue = boto3.client('glue')
response = glue.start_job_run(
JobName = 'My_Job_Glue',
Arguments = {
'--s3_target_path_key': fileName,
'--s3_target_path_bucket': bucketName
}
)
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!')
}
At first it works great and I'm partially getting what I need it to do.起初它工作得很好,我部分地得到了我需要它做的事情。 What actually happens, is that I will always have more than one file occurrence for the same Bucket and this current logic is running my Glue script at each occurrence (if I have 3 files, I have 3 Glue job runs).实际发生的情况是,对于同一个 Bucket,我总是会出现多个文件,并且当前的逻辑是在每次出现时运行我的 Glue 脚本(如果我有 3 个文件,我有 3 个 Glue 作业运行)。 How could I improve my function in order to run my script only when all new data is identified?只有在识别出所有新数据时,我才能改进我的功能以便运行我的脚本? Today I have Kafka Connect configured that batches 5000 records and if at the end of a few minutes this batch is not reached it forces as many records as it has there.今天我已经配置了 Kafka Connect 批处理 5000 条记录,如果在几分钟后没有达到这个批处理,它会强制执行尽可能多的记录。
S3 allows you to notify Lambda functions using Simple Queue Service (SQS). S3 允许您使用简单队列服务 (SQS) 通知 Lambda 函数。 Using SQS allows you to batch messages to Lambda.使用 SQS 允许您将消息批处理到 Lambda。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.