Lambda + awswrangler：处理“大型”镶木地板文件时性能不佳

Question

I'm currently writing a Lambda function to read parquet files from 100MB o 200MB on average using Python and the AWS wrangler function.我目前正在编写一个 Lambda 函数，使用 Python 和 AWS wrangler 函数从 100MB 或 200MB 平均读取镶木地板文件。 The idea is to read the files and transform them to csv:这个想法是读取文件并将它们转换为 csv：

import awswrangler as wr
from io import StringIO

print('Loading function')

s3 = boto3.client('s3')
dest_bucket = "mydestbucket"

def lambda_handler(event, context):
    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        print("CONTENTO TYPE: " + response['ContentType'])

        if key.endswith('.parquet'):
            dfs = wr.s3.read_parquet(path=['s3://' + bucket + '/' + key], chunked=True, use_threads=True)
            
            count=0
            for df in dfs:
                csv_buffer = StringIO()
                df.to_csv(csv_buffer)
                s3_resource = boto3.resource('s3')
                #s3_resource.Object(dest_bucket, 'dfo.csv').put(Body=df)
                s3_resource.Object(dest_bucket, 'dfo_' + str(count) + '.csv').put(Body=csv_buffer.getvalue())
                count += 1
                
            return "File written"

The function works ok when I use small files, but once I try with large files (100MB) it gives a timeout.当我使用小文件时，该功能可以正常工作，但是一旦我尝试使用大文件（100MB），它就会超时。

I already allocated 3GB of memory for Lambda and set a timeout of 10 min, however, it doesn't seem to do the trick.我已经为 Lambda 分配了 3GB 内存并设置了 10 分钟的超时时间，但是，它似乎并没有起到作用。

Do you know how to improve the performance apart from allocating more memory?除了分配更多内存之外，您知道如何提高性能吗？

Thanks!谢谢！

Answer 1

我通过使用 fastparquet 创建一个层解决了这个问题，它以比 aws wrangler 更优化的方式处理内存

Answer 2

from io import StringIO
from datetime import datetime

import boto3
import fastparquet as fp
import s3fs
import urllib.parse


    #S3 fs initialization
    s3_fs = s3fs.S3FileSystem()
    fs = s3fs.core.S3FileSystem()

    s3fs_path = fs.glob(path=s3_path)
    my_open = s3_fs.open

    # Read parquet object using fastparquet
    fp_obj = fp.ParquetFile(s3fs_path, open_with=my_open)

    # Filter columns and build a pandas df
    new_df = fp_obj.to_pandas()

    # csv buffer to perform the parquet --> csv transformation
    csv_buffer = StringIO()
    new_df.to_csv(csv_buffer)

    s3_resource.Object(
            dest_bucket,
            f"{file_path}",
        ).put(Body=csv_buffer.getvalue())

Lambda + awswrangler：处理“大型”镶木地板文件时性能不佳

问题描述

2 个解决方案

解决方案1
0 2022-06-22 18:02:09

解决方案2
0 2022-07-21 16:20:11

Lambda + awswrangler：处理“大型”镶木地板文件时性能不佳

问题描述

2 个解决方案

解决方案1 0 2022-06-22 18:02:09

解决方案2 0 2022-07-21 16:20:11

解决方案1
0 2022-06-22 18:02:09

解决方案2
0 2022-07-21 16:20:11