简体   繁体   English

如何从 memory 中的 azure 存储中读取大型 gzip csv 文件 Z945F3FC4429568A73B6ZF5?

[英]How to read large gzip csv files from azure storage in memory in aws lambda?

I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function. I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function.

Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb. Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb.

Is there any way of reading and processing those large data in memory in aws lambda.有没有什么方法可以读取和处理aws lambda中memory中的那些大数据。 Just to mention I tried stream way as well but no luck.顺便提一下,我也尝试过 stream 方式,但没有运气。

Below is my sample code-以下是我的示例代码-

from azure.storage.blob import *
import pandas as pd
import gzip

blob = BlobClient(account_url="https://" + SOURCE_ACCOUNT + ".blob.core.windows.net",
                          container_name=SOURCE_CONTAINER,
                          blob_name=blob_name,
                          credential=SOURCE_ACCT_KEY)
    
data = blob.download_blob()
data = data.readall()
    
unzipped_bytes = gzip.decompress(data)
unzipped_str = unzipped_bytes.decode("utf-8")
df = pd.read_csv(StringIO(unzipped_str), usecols=req_columns,low_memory=False)

Try using chunks and read data by n-rows per one time:尝试使用块并每次读取 n 行数据:

for chunk in pd.read_csv("large.csv", chunksize=10_000):
    print(chunk)

Although, I'm not sure about compressed files.虽然,我不确定压缩文件。 In worst case you can uncompress data and then read by chunks.在最坏的情况下,您可以解压缩数据,然后按块读取。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何使用Dask从Google云端存储中读取多个大型CSV文件的块,而不会一次使所有内存过载 - How to read chunks of multiple large CSV files from google cloud storage using Dask without overloading the memory all at once 在Python中读取大型Gzip文件 - Read Large Gzip Files in Python 如何从 AWS Lambda 中的 s3 存储桶读取 csv 文件? - How to read csv file from s3 bucket in AWS Lambda? AWS Lambda、Python3 和大型 XLSX 文件到 CSV 文件 - AWS Lambda, Python3 and Large XLSX files to CSV files 在AWS Lambda Boto3中编写Gzip文件 - Writing Gzip Files in AWS Lambda Boto3 如何从 Azure Functions 中的存储容器读取多个文件 - How to read multiple files from a storage container in Azure Functions 如何从 Python 中的 Azure 存储资源管理器中读取文件作为函数? - How to read files from Azure storage explorer in Python as a function? AWS Lambda:如何读取 S3 存储桶中的 CSV 文件然后将其上传到另一个 S3 存储桶? - AWS Lambda: How to read CSV files in S3 bucket then upload it to another S3 bucket? 如何在 aws lambda 中从 aws s3 读取 csv 文件 - How do I read a csv file from aws s3 in aws lambda 将大型csv文件读入字典时出现内存错误 - Memory error when read large csv files into dictionary
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM