I am considering moving to lambdas and after spending some time reading docs and various blogs with user experiences I am still struggling with a simple question. Is there a proposed/proper way to use lambda with existing s3 files?
I have an s3 bucket that contains archived data spanning a couple of years. The size of these data is rather large (hundreds of GB). Each file is a simple txt file. Each line in the file represents an event and it's just a comma separated string.
My endgame is to consume these files, parse each one of them line by line apply some transformation, create batches of lines and send them to an external service. From what I've read so far, if I write a proper lambda, this will be triggered by an s3 event (for example an upload of a new file).
Is there a way to apply the lambda to all the existing contents of my bucket?
Thanks
For existing resources you would need to write a script that gets a listing of all your resources and sends each item to a Lambda function somehow. I'd probably look into sending the location of each of your existing S3 objects to a Kenesis stream and configure a Lambda function to pull records from that stream and process them.
Try to copy your bucket content and catch create events with lambda.
copy:
s3cmd sync s3://from/this/bucket/ s3://to/this/bucket
for larger buckets:
Try using s3cmd.
s3cmd modify --recursive --add-header="touched:touched" s3://path/to/s3/bucket-or-folder
This will modify metadata and invoke an event for lambda
I had a similar problem I solved it with minimal changes to my existing Lambda function. The solution involves creating API Gateway trigger (in addition to S3 trigger) - the API gateway trigger is used to process historical files in S3 & the regular S3 trigger will processes my files as new files are uploaded to my S3 bucket.
Initially - I started by building my function to expect a S3 event as trigger. Recall that the S3 events have this structure - so I would look for the S3 bucket name and key to process - like so:
for record in event['Records']:
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
But when you send the API Gateway trigger there is no "Records" object in the request/event. You can use query parameters in the API Gateway Trigger - so the modification required to the above snippet of code is:
if 'Records' in event:
# this means we are working off of an S3 event
records_to_process = event['Records']
else:
# this is for ad-hoc posts via API Gateway trigger for Lambda
records_to_process = [{
"s3":{"bucket": {"name": event["queryStringParameters"]["bucket"]},
"object":{"key": event["queryStringParameters"]["file"]}}
}]
for record in records_to_process:
# below lines of code s same as the earlier snippet of code
bucket = record['s3']['bucket']['name']
key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')
temp_dir = tempfile.TemporaryDirectory()
video_filename = os.path.basename(key)
local_video_filename = os.path.join(temp_dir.name, video_filename)
s3_client.download_file(bucket, key, local_video_filename)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.