简体   繁体   中英

How to extract files from a zip archive in S3

I have a zip archive uploaded in S3 in a certain location (say /foo/bar.zip) I would like to extract the values within bar.zip and place it under /foo without downloading or re-uploading the extracted files. How can I do this, so that S3 is treated pretty much like a file system

S3 isn't really designed to allow this; normally you would have to download the file, process it and upload the extracted files.

However, there may be a few options:

  1. You could mount the S3 bucket as a local filesystem using s3fs and FUSE (see article and github site ). This still requires the files to be downloaded and uploaded, but it hides these operations away behind a filesystem interface.

  2. If your main concern is to avoid downloading data out of AWS to your local machine, then of course you could download the data onto a remote EC2 instance and do the work there, with or without s3fs . This keeps the data within Amazon data centers.

  3. You may be able to perform remote operations on the files, without downloading them onto your local machine, using AWS Lambda .

You would need to create, package and upload a small program written in node.js to access, decompress and upload the files. This processing will take place on AWS infrastructure behind the scenes, so you won't need to download any files to your own machine. See the FAQs .

Finally, you need to find a way to trigger this code - typically, in Lambda, this would be triggered automatically by upload of the zip file to S3. If the file is already there, you may need to trigger it manually, via the invoke-async command provided by the AWS API. See the AWS Lambda walkthroughs and API docs .

However, this is quite an elaborate way of avoiding downloads, and probably only worth it if you need to process large numbers of zip files! Note also that (as of Oct 2018) Lambda functions are limited to 15 minutes maximum duration ( default timeout is 3 seconds), so may run out of time if your files are extremely large - but since scratch space in /tmp is limited to 500MB, your filesize is also limited.

If keeping the data in AWS is the goal, you can use AWS Lambda to:

  1. Connect to S3 (I connect the Lambda function via a trigger from S3)
  2. Copy the data from S3
  3. Open the archive and decompress it (No need to write to disk)
  4. Do something with the data

If the function is initiated via a trigger, Lambda will suggest that you place the contents in a separate S3 location to avoid looping by accident. To open the archive, process it, and then return the contents you can do something like the following.

import csv, json
import os
import urllib.parse
import boto3
from zipfile import ZipFile
import io

s3 = boto3.client("s3")

def extract_zip(input_zip, file_name):
    contents = input_zip.read()
    input_zip = ZipFile(io.BytesIO(contents))
    return {name: input_zip.read(name) for name in input_zip.namelist()}
    
def lambda_handler(event, context):
    print("Received event: " + json.dumps(event, indent=2))
    # Get the object from the event and show its content type
    bucket = event["Records"][0]["s3"]["bucket"]["name"]
    key = urllib.parse.unquote_plus(
        event["Records"][0]["s3"]["object"]["key"], encoding="utf-8"
    )
    try:
        bucket = event['Records'][0]['s3']['bucket']['name']
        key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')

        response = s3.get_object(Bucket=bucket, Key=key)
        # This example assumes the file to process shares the archive's name
        file_name = key.split(".")[0] + ".csv"
        print(f"Attempting to open {key} and read {file_name}")
        print("CONTENT TYPE: " + response["ContentType"])
        data = []
        contents = extract_zip(response["Body"], file_name)
        for k, v in contents.items():
            print(v)
            reader = csv.reader(io.StringIO(v.decode('utf-8')), delimiter=',')
            for row in reader:
                data.append(row)
        return {
            "statusCode": 200,
            "body": data
        }

    except Exception as e:
        print(e)
        print(
            "Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.".format(
                key, bucket
            )
        )
        raise e

The code above accesses the file contents through response['Body'] where response is an event triggered by S3. The response body will be an instance of a StreamingBody object which is a file like object with a few convenience functions. Use the read() method, passing an amt argument if you are processing large files or files of unknown sizes. Working on an archive in memory requires a few extra steps. You will need to process the contents correctly, so wrap it in a BytesIO object and open it with the standard library's ZipFile , documentation here . Once you have the data passed to ZipFile, you can call read() on the contents. You will need to figure out what to do from here for your specific use case. If the archives have more than one file inside, you will need logic for handling each one. My example assumes you have one or a few small csv files to process and returns a dictionary with the file name as the key and the value set to the file contents.

I have included the next step of reading the CSV files and returning the data and a status code 200 in the response. Keep in mind, your needs may be different. This example wraps the data in a StringIO object and uses a CSV reader to handle the data. Once the result is passed via the response, the Lambda function can hand off the processing to another AWS process.

The following is an example of reading files inside a zip archive using s3fs . Let s3_file_path is the target file path on S3 -

import s3fs
from zipfile import ZipFile
import io

s3_file_path = '...'
fs = s3fs.S3FileSystem(anon=False)
input_zip = ZipFile(io.BytesIO(fs.cat(s3_file_path)))

encoding = 'ISO-8859-1'  # or 'utf-8'
for name in input_zip.namelist():
    data = input_zip.read(name).decode(encoding)
    print("filename: " + name)
    print("sample data: " + data[0:100])

You will need to adjust encoding for different kind of files.

You can use AWS Lambda for this. You can write a Python code that uses boto3 to connect to S3. Then you can read files into a buffer, and unzip them using these libraries:

import zipfile
import io

buffer = BytesIO(zipped_file.get()["Body"].read())
zipped = zipfile.ZipFile(buffer)
for file in zipped.namelist():
....

There is also a tutorial here: https://betterprogramming.pub/unzip-and-gzip-incoming-s3-files-with-aws-lambda-f7bccf0099c9

I faced a similar problem and have solved it by utilising Java AWS SDK. You still will download and re-upload the files back to S3 but the key is to "stream" the content, without keeping any data in memory or writing to disk.

I've made a library that can be used for this purpose and is available in Maven Central , here is the GitHub link: nejckorasa/s3-stream-unzip .

Unzipping is achieved without keeping data in memory or writing to disk. That makes it suitable for large data files - it has been used to unzip files of size 100GB+.

You can integrate it in your Lambda or anywhere with access to S3.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM