简体   繁体   中英

Load CSV stored as an Azure Blob directly into a Pandas data frame without saving to disk first

The article Explore data in Azure blob storage with pandas ( here ) shows how to load data from an Azure Blob Store into a Pandas data frame.

They do it by first downloading the blob and storing it locally as a CSV file and then loading that CSV file into a data frame.

import pandas as pd
from azure.storage.blob import BlockBlobService

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME, BLOBNAME, LOCALFILENAME)
dataframe_blobdata = pd.read_csv(LOCALFILE)

Is there a way to pull the blob directly into a data frame without saving it to local disk first?

You could try something like that (using StringIO ):

import pandas as pd
from azure.storage.blob import BlockBlobService
from io import StringIO

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
blob_string = blob_service.get_blob_to_text(CONTAINERNAME, BLOBNAME)
dataframe_blobdata = pd.read_csv(StringIO(blobstring))

Be aware that the file will be stored in-memory, which means that if it's a large file it can cause a MemoryError (maybe you can try to del the blob_string in order to free memory once you got data in dataframe, idk).

I've done more or less the same thing with Azure DataLake Storage Gen2 (which uses Azure Blob Storage).

Hope it helps.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM