内存错误：在python中从BLOB读取大的.txt文件时

Question

I am trying to read a large (~1.5 GB) .txt file from Azure blob in python which is giving Memory Error. 我正在尝试从python中的Azure blob读取较大的（〜1.5 GB）.txt文件，这会导致内存错误。 Is there a way in which I can read this file in an efficient way? 有没有一种方法可以有效地读取此文件？

Below is the code that I am trying to run: 以下是我尝试运行的代码：

from azure.storage.blob import BlockBlobService
import pandas as pd
from io import StringIO
import time

STORAGEACCOUNTNAME= '*********'
STORAGEACCOUNTKEY= "********"

CONTAINERNAME= '******'
BLOBNAME= 'path/to/blob'

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)

start = time.time()
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content

df = pd.read_csv(StringIO(blobstring))
end = time.time()

print("Time taken = ",end-start)

Below are last few lines of the error: 以下是错误的最后几行：

---> 16 blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME)
     17 
     18 #df = pd.read_csv(StringIO(blobstring))

~/anaconda3_420/lib/python3.5/site-packages/azure/storage/blob/baseblobservice.py in get_blob_to_text(self, container_name, blob_name, encoding, snapshot, start_range, end_range, validate_content, progress_callback, max_connections, lease_id, if_modified_since, if_unmodified_since, if_match, if_none_match, timeout)
   2378                                       if_none_match,
   2379                                       timeout)
-> 2380         blob.content = blob.content.decode(encoding)
   2381         return blob
   2382 

MemoryError:

How can I read a file of size ~1.5 GB in Python from a Blob container? 如何从Blob容器中读取Python大小约为1.5 GB的文件？ Also, I want to have an optimum runtime for my code. 另外，我想为我的代码提供最佳的运行时。

Answer 1

Assumed that there is enough memory in your machine, and according to the pandas.read_csv API reference below, you can directly read the csv blob content into pandas dataframe by the csv blob url with sas token. 假设您的计算机中有足够的内存，并且根据下面的pandas.read_csv API参考，您可以通过带有sas令牌的csv blob URL将csv blob内容直接读取到pandas数据帧中。

Here is my sample code as refenece for you. 这是我的示例代码供您参考。

from azure.storage.blob.baseblobservice import BaseBlobService
from azure.storage.blob import BlobPermissions
from datetime import datetime, timedelta

import pandas as pd

account_name = '<your storage account name>'
account_key = '<your storage account key>'
container_name = '<your container name>'
blob_name = '<your csv blob name>'

url = f"https://{account_name}.blob.core.windows.net/{container_name}/{blob_name}"

service = BaseBlobService(account_name=account_name, account_key=account_key)
# Generate the sas token for your csv blob
token = service.generate_blob_shared_access_signature(container_name, blob_name, permission=BlobPermissions.READ, expiry=datetime.utcnow() + timedelta(hours=1),)

# Directly read the csv blob content into dataframe by the url with sas token
df = pd.read_csv(f"{url}?{token}")
print(df)

I think it will avoid to copy memory few times when read the text content and convert it to a file-like object buffer . 我认为这样做可以避免在读取文本内容时将内存复制几次，并将其转换为类似file-like对象buffer 。

Hope it helps. 希望能帮助到你。

内存错误：在python中从BLOB读取大的.txt文件时

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-06-25 09:18:07

内存错误：在python中从BLOB读取大的.txt文件时

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-06-25 09:18:07

解决方案1
2 已采纳 2019-06-25 09:18:07