如何从 memory 中的 azure 存储中读取大型 gzip csv 文件 Z945F3FC4429568A73B6ZF5？

Question

I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function. I'm trying to transform large gzip csv files(>3 Gigs) from azure storage blob by loading it to pandas dataframe in AWS Lambda function.

Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb. Local machine with 16 gigs is able to process my files but it gives memory error "errorType": "MemoryError", while executing in aws lambda, since lambda has max memory is 3 Gb.

Is there any way of reading and processing those large data in memory in aws lambda.有没有什么方法可以读取和处理aws lambda中memory中的那些大数据。 Just to mention I tried stream way as well but no luck.顺便提一下，我也尝试过 stream 方式，但没有运气。

Below is my sample code-以下是我的示例代码-

from azure.storage.blob import *
import pandas as pd
import gzip

blob = BlobClient(account_url="https://" + SOURCE_ACCOUNT + ".blob.core.windows.net",
                          container_name=SOURCE_CONTAINER,
                          blob_name=blob_name,
                          credential=SOURCE_ACCT_KEY)
    
data = blob.download_blob()
data = data.readall()
    
unzipped_bytes = gzip.decompress(data)
unzipped_str = unzipped_bytes.decode("utf-8")
df = pd.read_csv(StringIO(unzipped_str), usecols=req_columns,low_memory=False)

Answer 1

Try using chunks and read data by n-rows per one time:尝试使用块并每次读取 n 行数据：

for chunk in pd.read_csv("large.csv", chunksize=10_000):
    print(chunk)

Although, I'm not sure about compressed files.虽然，我不确定压缩文件。 In worst case you can uncompress data and then read by chunks.在最坏的情况下，您可以解压缩数据，然后按块读取。

如何从 memory 中的 azure 存储中读取大型 gzip csv 文件 Z945F3FC4429568A73B6ZF5？

问题描述

1 个解决方案

解决方案1
0 2020-08-14 08:25:18

如何从 memory 中的 azure 存储中读取大型 gzip csv 文件 Z945F3FC4429568A73B6ZF5？

问题描述

1 个解决方案

解决方案1 0 2020-08-14 08:25:18

解决方案1
0 2020-08-14 08:25:18