尝试使用 Lambda 在 S3 上拆分大型 TSV 文件

Question

The Goal I have a data generating process which creates one large TSV file on S3 (somewhere between 30-40 GB in size).目标我有一个数据生成过程，它在 S3 上创建一个大型 TSV 文件（大小在 30-40 GB 之间）。 Because of some data processing I want to do on it, it's easier to have it in many smaller files (~1 GB in size or smaller).由于我想对其进行一些数据处理，因此更容易将其保存在许多较小的文件中（大约 1 GB 或更小）。 Unfortunately I don't have a lot of ability to change the original data generating process to partition out the files at creation, so I'm trying to create a simple lambda to do it for me, my attempt is below不幸的是，我没有很多能力来更改原始数据生成过程以在创建时对文件进行分区，所以我正在尝试创建一个简单的 lambda 来为我做这件事，我的尝试如下

import json
import boto3
import codecs

def lambda_handler(event, context):
    s3 = boto3.client('s3')
    read_bucket_name = 'some-bucket'
    write_bucket_name = 'some-other-bucket'

    original_key = 'some-s3-file-key'
    obj = s3.get_object(Bucket=read_bucket_name, Key=original_key)

    lines = []
    line_count = 0
    file_count = 0
    MAX_LINE_COUNT = 500000

    def create_split_name(file_count):
        return f'{original_key}-{file_count}'

    def create_body(lines):
        return ''.join(lines)

    for ln in codecs.getreader('utf-8')(obj['Body']):
        if line_count > MAX_LINE_COUNT:
            key = create_split_name(file_count)

            s3.put_object(
                Bucket=write_bucket_name,
                Key=key,
                Body=create_body(lines)
            )

            lines = []
            line_count = 0
            file_count += 1

        lines.append(ln)
        line_count += 1

    if len(lines) > 0:
        file_count += 1
        key = create_split_name(file_count)
        s3.put_object(
                Bucket=write_bucket_name,
                Key=key,
                Body=create_body(lines)
        )

    return {
        'statusCode': 200,
        'body': { 'file_count': file_count }
    }

This functionally works which is great but the problem is on files that are sufficiently large enough this can't finish in the 15 min run window of an AWS lambda.这在功能上很有效，但问题在于文件足够大，无法在 AWS lambda 的 15 分钟运行窗口中完成。 So my questions are these所以我的问题是这些

Can this code be optimized in any appreciable way to reduce run time (I'm not an expert on profiling lambda code)?能否以任何明显的方式优化此代码以减少运行时间（我不是分析 lambda 代码的专家）？
Will porting this to a compiled language provide any real benefit to run time?将其移植到编译语言是否会为运行时带来任何真正的好处？
Are there other utilities within AWS that can solve this problem? AWS 中是否有其他实用程序可以解决此问题？ (A quick note here, I know that I could spin up an EC2 server to do this for me but ideally I'm trying to find a serverless solution) （这里有一个简短的说明，我知道我可以启动一个 EC2 服务器来为我做这件事，但理想情况下我正在尝试找到一个无服务器解决方案）

UPDATE Another option I have tried is to not split up the file but tell different lambda jobs to simply read different parts of the same file using Range .更新我尝试过的另一个选择是不拆分文件，而是告诉不同的 lambda 作业使用Range简单地读取同一文件的不同部分。

I can try to read a file by doing我可以尝试通过执行读取文件

obj = s3.get_object(Bucket='cradle-smorgasbord-drop', Key=key, Range=bytes_range)
lines = [line for line in codecs.getreader('utf-8')(obj['Body'])]

However on an approximately 30 GB file I had bytes_range=0-49999999 which is only the first 50 MB and the download is taking way longer than I would think that it should for that amount of data (I actually haven't even seen it finish yet)然而，在一个大约 30 GB 的文件上，我有bytes_range=0-49999999这只是前 50 MB，下载时间比我想象的要长的数据量（实际上我什至还没有看到它完成然而）

Answer 1

To avoid hitting the limit of 15 minutes for the execution of AWS Lambda functions you have to ensure that you only read as much data from S3 as you can process in 15 minutes or less.为避免达到 AWS Lambda 函数执行时间的 15 分钟限制，您必须确保仅从 S3 读取尽可能多的数据，以在 15 分钟或更短的时间内处理。

How much data from S3 you can process in 15 minutes or less depends on your function logic and the CPU and network performance of the AWS Lambda function.您可以在 15 分钟或更短的时间内处理多少来自 S3 的数据取决于您的函数逻辑以及 AWS Lambda 函数的 CPU 和网络性能。 The available CPU performance of AWS Lambda functions scales with memory provided to the AWS Lambda function. AWS Lambda 函数的可用 CPU 性能随着提供给 AWS Lambda 函数的内存而扩展。 From the AWS Lambda documentation :来自AWS Lambda 文档：

Lambda allocates CPU power linearly in proportion to the amount of memory configured. Lambda 会与配置的内存量成比例地线性分配 CPU 功率。 At 1,792 MB, a function has the equivalent of one full vCPU (one vCPU-second of credits per second).一个函数大小为 1,792 MB，相当于一个完整的 vCPU（每秒一个 vCPU 的积分）。

So as a first step you could try to increase the provided memory to see if that improves the amount of data your function can process in 15 minutes.因此，作为第一步，您可以尝试增加提供的内存，看看这是否会提高您的函数可以在 15 分钟内处理的数据量。

Increasing the CPU performance for AWS Lambda functions might already solve your problem for now, but it doesn't scale well in case you have to process larger files in future.提高 AWS Lambda 函数的 CPU 性能现在可能已经解决了您的问题，但如果您将来必须处理更大的文件，则它无法很好地扩展。

Fortunately there is a solution for that: When reading objects from S3 you don't have to read the whole object at once, but you can use range requests to only read a part of the object.幸运的是，有一个解决方案：从 S3 读取对象时，您不必一次读取整个对象，但您可以使用范围请求仅读取对象的一部分。 To do that all you have to do is to specify the range you want to read when calling get_object() .为此，您只需在调用get_object()时指定要读取的范围。 From the boto3 documentation for get_object() :来自get_object()的boto3 文档：

Range ( string ) -- Downloads the specified range bytes of an object. Range ( string ) -- 下载对象的指定范围字节。 For more information about the HTTP Range header, go tohttp://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 .有关 HTTP 范围标头的更多信息，请访问http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35 。

In your case instead of triggering your AWS Lambda function once per object in S3 to process, you'd trigger it multiple times for the same objects, but to process different chunks of that object.在您的情况下，您不是为 S3 中的每个对象触发一次 AWS Lambda 函数进行处理，而是针对相同的对象多次触发它，但要处理该对象的不同块。 Depending on how you invoke your function you might need another AWS Lambda function to examine the size of the objects in S3 to process (using head_object() ) and trigger your actual Lambda function once for each chunk of data.根据您调用函数的方式，您可能需要另一个 AWS Lambda 函数来检查 S3 中要处理的对象的大小（使用head_object() ）并为每个数据块触发一次实际的 Lambda 函数。

While you need that additional chunking logic, you wouldn't need to split the read data in your original AWS Lambda function anymore, as you could simply ensure that each chunk has a size of 1GB and only the data belonging to the chunk is read thanks to the range request.虽然您需要额外的分块逻辑，但您不再需要在原始 AWS Lambda 函数中拆分读取数据，因为您只需确保每个块的大小为 1GB，并且只读取属于该块的数据，谢谢到范围请求。 As you'd invoke a separate AWS Lambda function for each chunk you'd also parallelize your currently sequential logic, resulting in faster execution.当您为每个块调用单独的 AWS Lambda 函数时，您还将并行化当前的顺序逻辑，从而加快执行速度。

Finally you could drastically decrease the amount of memory consumed by your AWS Lambda function, by not reading the whole data into memory, but using streaming instead.最后，您可以不将整个数据读入内存，而是使用流式传输，从而大幅减少 AWS Lambda 函数消耗的内存量。

尝试使用 Lambda 在 S3 上拆分大型 TSV 文件

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-29 18:04:34

尝试使用 Lambda 在 S3 上拆分大型 TSV 文件

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-29 18:04:34

解决方案1
2 已采纳 2019-11-29 18:04:34