简体   繁体   English

Lambda + awswrangler:处理“大型”镶木地板文件时性能不佳

[英]Lambda + awswrangler: Poor performance while handling "large" parquet files

I'm currently writing a Lambda function to read parquet files from 100MB o 200MB on average using Python and the AWS wrangler function.我目前正在编写一个 Lambda 函数,使用 Python 和 AWS wrangler 函数从 100MB 或 200MB 平均读取镶木地板文件。 The idea is to read the files and transform them to csv:这个想法是读取文件并将它们转换为 csv:

import awswrangler as wr
from io import StringIO

print('Loading function')

s3 = boto3.client('s3')
dest_bucket = "mydestbucket"

def lambda_handler(event, context):
    # Get the object from the event and show its content type
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        print("CONTENTO TYPE: " + response['ContentType'])

        if key.endswith('.parquet'):
            dfs = wr.s3.read_parquet(path=['s3://' + bucket + '/' + key], chunked=True, use_threads=True)
            
            count=0
            for df in dfs:
                csv_buffer = StringIO()
                df.to_csv(csv_buffer)
                s3_resource = boto3.resource('s3')
                #s3_resource.Object(dest_bucket, 'dfo.csv').put(Body=df)
                s3_resource.Object(dest_bucket, 'dfo_' + str(count) + '.csv').put(Body=csv_buffer.getvalue())
                count += 1
                
            return "File written" 

The function works ok when I use small files, but once I try with large files (100MB) it gives a timeout.当我使用小文件时,该功能可以正常工作,但是一旦我尝试使用大文件(100MB),它就会超时。

I already allocated 3GB of memory for Lambda and set a timeout of 10 min, however, it doesn't seem to do the trick.我已经为 Lambda 分配了 3GB 内存并设置了 10 分钟的超时时间,但是,它似乎并没有起到作用。

Do you know how to improve the performance apart from allocating more memory?除了分配更多内存之外,您知道如何提高性能吗?

Thanks!谢谢!

我通过使用 fastparquet 创建一个层解决了这个问题,它以比 aws wrangler 更优化的方式处理内存

from io import StringIO
from datetime import datetime

import boto3
import fastparquet as fp
import s3fs
import urllib.parse


    #S3 fs initialization
    s3_fs = s3fs.S3FileSystem()
    fs = s3fs.core.S3FileSystem()

    s3fs_path = fs.glob(path=s3_path)
    my_open = s3_fs.open

    # Read parquet object using fastparquet
    fp_obj = fp.ParquetFile(s3fs_path, open_with=my_open)

    # Filter columns and build a pandas df
    new_df = fp_obj.to_pandas()

    # csv buffer to perform the parquet --> csv transformation
    csv_buffer = StringIO()
    new_df.to_csv(csv_buffer)

    s3_resource.Object(
            dest_bucket,
            f"{file_path}",
        ).put(Body=csv_buffer.getvalue())

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM