[英]Lambda + awswrangler: Poor performance while handling "large" parquet files
I'm currently writing a Lambda function to read parquet files from 100MB o 200MB on average using Python and the AWS wrangler function.我目前正在编写一个 Lambda 函数,使用 Python 和 AWS wrangler 函数从 100MB 或 200MB 平均读取镶木地板文件。 The idea is to read the files and transform them to csv:
这个想法是读取文件并将它们转换为 csv:
import awswrangler as wr
from io import StringIO
print('Loading function')
s3 = boto3.client('s3')
dest_bucket = "mydestbucket"
def lambda_handler(event, context):
# Get the object from the event and show its content type
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.parse.unquote_plus(event['Records'][0]['s3']['object']['key'], encoding='utf-8')
try:
response = s3.get_object(Bucket=bucket, Key=key)
print("CONTENTO TYPE: " + response['ContentType'])
if key.endswith('.parquet'):
dfs = wr.s3.read_parquet(path=['s3://' + bucket + '/' + key], chunked=True, use_threads=True)
count=0
for df in dfs:
csv_buffer = StringIO()
df.to_csv(csv_buffer)
s3_resource = boto3.resource('s3')
#s3_resource.Object(dest_bucket, 'dfo.csv').put(Body=df)
s3_resource.Object(dest_bucket, 'dfo_' + str(count) + '.csv').put(Body=csv_buffer.getvalue())
count += 1
return "File written"
The function works ok when I use small files, but once I try with large files (100MB) it gives a timeout.当我使用小文件时,该功能可以正常工作,但是一旦我尝试使用大文件(100MB),它就会超时。
I already allocated 3GB of memory for Lambda and set a timeout of 10 min, however, it doesn't seem to do the trick.我已经为 Lambda 分配了 3GB 内存并设置了 10 分钟的超时时间,但是,它似乎并没有起到作用。
Do you know how to improve the performance apart from allocating more memory?除了分配更多内存之外,您知道如何提高性能吗?
Thanks!谢谢!
我通过使用 fastparquet 创建一个层解决了这个问题,它以比 aws wrangler 更优化的方式处理内存
from io import StringIO
from datetime import datetime
import boto3
import fastparquet as fp
import s3fs
import urllib.parse
#S3 fs initialization
s3_fs = s3fs.S3FileSystem()
fs = s3fs.core.S3FileSystem()
s3fs_path = fs.glob(path=s3_path)
my_open = s3_fs.open
# Read parquet object using fastparquet
fp_obj = fp.ParquetFile(s3fs_path, open_with=my_open)
# Filter columns and build a pandas df
new_df = fp_obj.to_pandas()
# csv buffer to perform the parquet --> csv transformation
csv_buffer = StringIO()
new_df.to_csv(csv_buffer)
s3_resource.Object(
dest_bucket,
f"{file_path}",
).put(Body=csv_buffer.getvalue())
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.