简体   繁体   中英

Trying to split a large TSV file on S3 w/ Lambda

The Goal I have a data generating process which creates one large TSV file on S3 (somewhere between 30-40 GB in size). Because of some data processing I want to do on it, it's easier to have it in many smaller files (~1 GB in size or smaller). Unfortunately I don't have a lot of ability to change the original data generating process to partition out the files at creation, so I'm trying to create a simple lambda to do it for me, my attempt is below

import json
import boto3
import codecs

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    read_bucket_name = 'some-bucket'
    write_bucket_name = 'some-other-bucket'

    original_key = 'some-s3-file-key'
    obj = s3.get_object(Bucket=read_bucket_name, Key=original_key)

    lines = []
    line_count = 0
    file_count = 0
    MAX_LINE_COUNT = 500000

    def create_split_name(file_count):
        return f'{original_key}-{file_count}'

    def create_body(lines):
        return ''.join(lines)

    for ln in codecs.getreader('utf-8')(obj['Body']):
        if line_count > MAX_LINE_COUNT:
            key = create_split_name(file_count)

            s3.put_object(
                Bucket=write_bucket_name,
                Key=key,
                Body=create_body(lines)
            )

            lines = []
            line_count = 0
            file_count += 1

        lines.append(ln)
        line_count += 1

    if len(lines) > 0:
        file_count += 1
        key = create_split_name(file_count)
        s3.put_object(
                Bucket=write_bucket_name,
                Key=key,
                Body=create_body(lines)
        )

    return {
        'statusCode': 200,
        'body': { 'file_count': file_count }
    }

This functionally works which is great but the problem is on files that are sufficiently large enough this can't finish in the 15 min run window of an AWS lambda. So my questions are these

  1. Can this code be optimized in any appreciable way to reduce run time (I'm not an expert on profiling lambda code)?
  2. Will porting this to a compiled language provide any real benefit to run time?
  3. Are there other utilities within AWS that can solve this problem? (A quick note here, I know that I could spin up an EC2 server to do this for me but ideally I'm trying to find a serverless solution)

UPDATE Another option I have tried is to not split up the file but tell different lambda jobs to simply read different parts of the same file using Range .

I can try to read a file by doing

obj = s3.get_object(Bucket='cradle-smorgasbord-drop', Key=key, Range=bytes_range)
lines = [line for line in codecs.getreader('utf-8')(obj['Body'])]

However on an approximately 30 GB file I had bytes_range=0-49999999 which is only the first 50 MB and the download is taking way longer than I would think that it should for that amount of data (I actually haven't even seen it finish yet)

To avoid hitting the limit of 15 minutes for the execution of AWS Lambda functions you have to ensure that you only read as much data from S3 as you can process in 15 minutes or less.

How much data from S3 you can process in 15 minutes or less depends on your function logic and the CPU and network performance of the AWS Lambda function. The available CPU performance of AWS Lambda functions scales with memory provided to the AWS Lambda function. From the AWS Lambda documentation :

Lambda allocates CPU power linearly in proportion to the amount of memory configured. At 1,792 MB, a function has the equivalent of one full vCPU (one vCPU-second of credits per second).

So as a first step you could try to increase the provided memory to see if that improves the amount of data your function can process in 15 minutes.

Increasing the CPU performance for AWS Lambda functions might already solve your problem for now, but it doesn't scale well in case you have to process larger files in future.

Fortunately there is a solution for that: When reading objects from S3 you don't have to read the whole object at once, but you can use range requests to only read a part of the object. To do that all you have to do is to specify the range you want to read when calling get_object() . From the boto3 documentation for get_object() :

Range ( string ) -- Downloads the specified range bytes of an object. For more information about the HTTP Range header, go tohttp://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 .

In your case instead of triggering your AWS Lambda function once per object in S3 to process, you'd trigger it multiple times for the same objects, but to process different chunks of that object. Depending on how you invoke your function you might need another AWS Lambda function to examine the size of the objects in S3 to process (using head_object() ) and trigger your actual Lambda function once for each chunk of data.

While you need that additional chunking logic, you wouldn't need to split the read data in your original AWS Lambda function anymore, as you could simply ensure that each chunk has a size of 1GB and only the data belonging to the chunk is read thanks to the range request. As you'd invoke a separate AWS Lambda function for each chunk you'd also parallelize your currently sequential logic, resulting in faster execution.

Finally you could drastically decrease the amount of memory consumed by your AWS Lambda function, by not reading the whole data into memory, but using streaming instead.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM