简体   繁体   English

内存错误:在python中从BLOB读取大的.txt文件时

[英]Memory Error:While reading a large .txt file from BLOB in python

I am trying to read a large (~1.5 GB) .txt file from Azure blob in python which is giving Memory Error. 我正在尝试从python中的Azure blob读取较大的(〜1.5 GB).txt文件,这会导致内存错误。 Is there a way in which I can read this file in an efficient way? 有没有一种方法可以有效地读取此文件?

Below is the code that I am trying to run: 以下是我尝试运行的代码:

from azure.storage.blob import BlockBlobService
import pandas as pd
from io import StringIO
import time

STORAGEACCOUNTNAME= '*********'
STORAGEACCOUNTKEY= "********"

CONTAINERNAME= '******'
BLOBNAME= 'path/to/blob'

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)

start = time.time()
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content

df = pd.read_csv(StringIO(blobstring))
end = time.time()

print("Time taken = ",end-start)

Below are last few lines of the error: 以下是错误的最后几行:

---> 16 blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
     17 
     18 #df = pd.read_csv(StringIO(blobstring))

~/anaconda3_420/lib/python3.5/site-packages/azure/storage/blob/baseblobservice.py in get_blob_to_text(self, container_name, blob_name, encoding, snapshot, start_range, end_range, validate_content, progress_callback, max_connections, lease_id, if_modified_since, if_unmodified_since, if_match, if_none_match, timeout)
   2378                                       if_none_match,
   2379                                       timeout)
-> 2380         blob.content = blob.content.decode(encoding)
   2381         return blob
   2382 

MemoryError:

How can I read a file of size ~1.5 GB in Python from a Blob container? 如何从Blob容器中读取Python大小约为1.5 GB的文件? Also, I want to have an optimum runtime for my code. 另外,我想为我的代码提供最佳的运行时。

Assumed that there is enough memory in your machine, and according to the pandas.read_csv API reference below, you can directly read the csv blob content into pandas dataframe by the csv blob url with sas token. 假设您的计算机中有足够的内存,并且根据下面的pandas.read_csv API参考,您可以通过带有sas令牌的csv blob URL将csv blob内容直接读取到pandas数据帧中。

在此处输入图片说明

Here is my sample code as refenece for you. 这是我的示例代码供您参考。

from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta

import pandas as pd

account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_name = '<your csv blob name>'

url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"

service = BaseBlobService(account_name=account_name, account_key=account_key)
# Generate the sas token for your csv blob
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)

# Directly read the csv blob content into dataframe by the url with sas token
df = pd.read_csv(f"{url}?{token}")
print(df)

I think it will avoid to copy memory few times when read the text content and convert it to a file-like object buffer . 我认为这样做可以避免在读取文本内容时将内存复制几次,并将其转换为类似file-like对象buffer

Hope it helps. 希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM