简体   繁体   中英

AWS Lambda (Python) Fails to unzip and store files in S3

Project currently maintains S3 bucket which holds a large zip size 1.5 GB containing .xpt and .sas7dbat files. Unzipped file size is 20 GB.

Trying to unzip file and push the same folder structure to S3

Following code works for a small zip files but fails for large Zip file (1.5GB) :

for obj in bucket.objects.all():
    #file_name = os.path.abspath(obj.key) # get full path of files
    key = urlparse(obj.key.encode('utf8'))
    obj = client.get_object(Bucket='my-zip-bucket', Key=obj.key)

    with io.BytesIO(obj["Body"].read()) as tf:
        # rewind the file
        tf.seek(0)

        with zipfile.ZipFile(tf, mode='r') as zipf:
            for file in zipf.infolist():
                fileName = file.filename
                putFile = client.put_object(Bucket='my-un-zip-bucket-', Key=fileName, Body=zipf.read(file))
                putObjects.append(putFile)

Error : Memory Size: 3008 MB Max Memory Used: 3008 MB

I would like to validate :

  1. AWS-Lambda is not a suitable solution for large files ?
  2. Should I use different libraries / approach rather than reading everything in memory

There is a serverless solution using AWS Glue! (I nearly died figuring this out)

This solution is two parts:

  1. A lambda function that is triggered by S3 upon upload of a ZIP file and creates a GlueJobRun - passing the S3 Object key as an argument to Glue.
  2. A Glue Job that unzips files (in memory!) and uploads back to S3.

See my code below which unzips the ZIP file and places the contents back into the same bucket (configurable).

Please upvote if helpful :)

Lambda Script (python3) that calls a Glue Job called YourGlueJob

import boto3
import urllib.parse

glue = boto3.client('glue')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    print(key)    
try:
    newJobRun = glue.start_job_run(
        JobName = 'YourGlueJob',
        Arguments = {
            '--bucket':bucket,
            '--key':key,
        }
        )
    print("Successfully created unzip job")    
    return key  
except Exception as e:
    print(e)
    print('Error starting unzip job for' + key)
    raise e         

AWS Glue Job Script to unzip the files

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME','bucket','key'],)

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

import boto3
import zipfile
import io
from contextlib import closing

s3 = boto3.client('s3')
s3r = boto3.resource('s3')

bucket = args["bucket"]
key = args["key"]

obj = s3r.Object(
    bucket_name=bucket,
    key=key
)

buffer = io.BytesIO(obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
list = z.namelist()
for filerr in list:
    print(filerr)
    y=z.open(filerr)
    arcname = key + filerr
    x = io.BytesIO(y.read())
    s3.upload_fileobj(x, bucket, arcname)
    y.close()
print(list)


job.commit()

As described in this AWS Lambda Limits link :

But there are limits that AWS Lambda imposes that include, for example, the size of your deployment package or the amount of memory your Lambda function is allocated per invocation.

Here, the issue you are having is because of "amount of memory Lambda function is allocated per invocation" needed. Unfortunately, Lambda is not an applicable solution for this case. You need to go with EC2 approach.

When your overall memory requirement is high, I don't think Lambda is great solution. I am not about how the specified file types work, but in general read/processing large files use chunked approach to avoid large memory requirements. Whether chunked approach works or not depends on your business requirement.

Kudos to @Ganondorfz for the serverless solution.

I tried a similar thing and also using a Go lambda for unzipping. Thought it might be worth noting what wasn't initially very clear to me when starting to look into this.

To answer the questions:

  1. AWS-Lambda is not a suitable solution for large files ?

Not for zip file unzipping. Zips are an archive format which have the file index at the end and AFAICT all utilities and libraries expect to seek to given file locations inside, and therefore are limited by lambda's disk and memory constraints. I guess something could be written jump to ranges within the S3 objects, but this would be a pretty complex solution - I haven't seen utilities for this (though I could be wrong), and it's much simpler to use an EC2 instance or container with appropriate resources to achieve the unzipping.

It is possible however to stream gzip files and therefore use lambda for large file decompression here.

It is also possible to do the reverse of the use case - stream the reading of files from S3 into a zip written to S3.

  1. Should I use different libraries / approach rather than reading everything in memory

I had more success / saw better resource utilisation with a Go runtime, but as above, I don't believe lambda by itself will work for this use case.

Ref also: https://dev.to/flowup/using-io-reader-io-writer-in-go-to-stream-data-3i7b

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM