简体   繁体   中英

aws lambda s3 events for existing files

I am considering moving to lambdas and after spending some time reading docs and various blogs with user experiences I am still struggling with a simple question. Is there a proposed/proper way to use lambda with existing s3 files?

I have an s3 bucket that contains archived data spanning a couple of years. The size of these data is rather large (hundreds of GB). Each file is a simple txt file. Each line in the file represents an event and it's just a comma separated string.

My endgame is to consume these files, parse each one of them line by line apply some transformation, create batches of lines and send them to an external service. From what I've read so far, if I write a proper lambda, this will be triggered by an s3 event (for example an upload of a new file).

Is there a way to apply the lambda to all the existing contents of my bucket?

Thanks

For existing resources you would need to write a script that gets a listing of all your resources and sends each item to a Lambda function somehow. I'd probably look into sending the location of each of your existing S3 objects to a Kenesis stream and configure a Lambda function to pull records from that stream and process them.

Try to copy your bucket content and catch create events with lambda.

copy:

s3cmd sync s3://from/this/bucket/ s3://to/this/bucket

for larger buckets:

https://github.com/paultuckey/s3_bucket_to_bucket_copy_py

Try using s3cmd.

s3cmd modify --recursive --add-header="touched:touched" s3://path/to/s3/bucket-or-folder

This will modify metadata and invoke an event for lambda

I had a similar problem I solved it with minimal changes to my existing Lambda function. The solution involves creating API Gateway trigger (in addition to S3 trigger) - the API gateway trigger is used to process historical files in S3 & the regular S3 trigger will processes my files as new files are uploaded to my S3 bucket.

Initially - I started by building my function to expect a S3 event as trigger. Recall that the S3 events have this structure - so I would look for the S3 bucket name and key to process - like so:

    for record in event['Records']:
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')

        temp_dir = tempfile.TemporaryDirectory()
        video_filename = os.path.basename(key)
        local_video_filename = os.path.join(temp_dir.name, video_filename)
        s3_client.download_file(bucket, key, local_video_filename)

But when you send the API Gateway trigger there is no "Records" object in the request/event. You can use query parameters in the API Gateway Trigger - so the modification required to the above snippet of code is:

    if 'Records' in event:
        # this means we are working off of an S3 event
        records_to_process = event['Records']
    else:
        # this is for ad-hoc posts via API Gateway trigger for Lambda
        records_to_process = [{
        "s3":{"bucket": {"name":  event["queryStringParameters"]["bucket"]},
              "object":{"key": event["queryStringParameters"]["file"]}}
    }]

    for record in records_to_process:
        # below lines of code s same as the earlier snippet of code
        bucket = record['s3']['bucket']['name']
        key = unquote_plus(record['s3']['object']['key'], encoding='utf-8')

        temp_dir = tempfile.TemporaryDirectory()
        video_filename = os.path.basename(key)
        local_video_filename = os.path.join(temp_dir.name, video_filename)
        s3_client.download_file(bucket, key, local_video_filename)

Postman result of sending the post request

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM