[英]How to read large gzip csv files from azure storage in memory in aws lambda?
I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function. I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function.
Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb. Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb.
Is there any way of reading and processing those large data in memory in aws lambda.有没有什么方法可以读取和处理aws lambda中memory中的那些大数据。 Just to mention I tried stream way as well but no luck.
顺便提一下,我也尝试过 stream 方式,但没有运气。
Below is my sample code-以下是我的示例代码-
from azure.storage.blob import *
import pandas as pd
import gzip
blob = BlobClient(account_url="https://" + SOURCE_ACCOUNT + ".blob.core.windows.net",
container_name=SOURCE_CONTAINER,
blob_name=blob_name,
credential=SOURCE_ACCT_KEY)
data = blob.download_blob()
data = data.readall()
unzipped_bytes = gzip.decompress(data)
unzipped_str = unzipped_bytes.decode("utf-8")
df = pd.read_csv(StringIO(unzipped_str), usecols=req_columns,low_memory=False)
Try using chunks and read data by n-rows per one time:尝试使用块并每次读取 n 行数据:
for chunk in pd.read_csv("large.csv", chunksize=10_000):
print(chunk)
Although, I'm not sure about compressed files.虽然,我不确定压缩文件。 In worst case you can uncompress data and then read by chunks.
在最坏的情况下,您可以解压缩数据,然后按块读取。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.