简体   繁体   English

将存储为 Azure Blob 的 CSV 直接加载到 Pandas 数据帧中,而无需先保存到磁盘

[英]Load CSV stored as an Azure Blob directly into a Pandas data frame without saving to disk first

The article Explore data in Azure blob storage with pandas ( here ) shows how to load data from an Azure Blob Store into a Pandas data frame. The article Explore data in Azure blob storage with pandas ( here ) shows how to load data from an Azure Blob Store into a Pandas data frame.

They do it by first downloading the blob and storing it locally as a CSV file and then loading that CSV file into a data frame.他们首先下载 blob 并将其作为 CSV 文件在本地存储,然后将该 CSV 文件加载到数据帧中。

import pandas as pd
from azure.storage.blob import BlockBlobService

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME, BLOBNAME, LOCALFILENAME)
dataframe_blobdata = pd.read_csv(LOCALFILE)

Is there a way to pull the blob directly into a data frame without saving it to local disk first?有没有办法将 blob 直接拉入数据帧而不先将其保存到本地磁盘?

You could try something like that (using StringIO ):你可以尝试这样的事情(使用StringIO ):

import pandas as pd
from azure.storage.blob import BlockBlobService
from io import StringIO

blob_service = BlockBlobService(account_name=STORAGEACCOUNTNAME, account_key=STORAGEACCOUNTKEY)
blob_string = blob_service.get_blob_to_text(CONTAINERNAME, BLOBNAME)
dataframe_blobdata = pd.read_csv(StringIO(blobstring))

Be aware that the file will be stored in-memory, which means that if it's a large file it can cause a MemoryError (maybe you can try to del the blob_string in order to free memory once you got data in dataframe, idk).请注意,该文件将存储在内存中,这意味着如果它是一个大文件,则可能会导致MemoryError (也许您可以尝试del blob_string以便在您获得 dataframe、idk 中的数据后释放 memory)。

I've done more or less the same thing with Azure DataLake Storage Gen2 (which uses Azure Blob Storage).我对 Azure DataLake Storage Gen2(它使用 Azure Blob Storage)或多或少做了同样的事情。

Hope it helps.希望能帮助到你。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 将 pandas dataframe 变量从谷歌计算引擎保存到 csv 到谷歌存储桶而不先保存到磁盘 - Saving pandas dataframe variable to csv from google compute engine to google storage bucket without saving to disk first 直接将图形保存到磁盘,而无需先在屏幕上渲染 - Directly saving a figure to disk without rendering first on the screen 将缓冲的 csv 从 pandas 直接上传到 azure blob 存储 - Upload buffered csv from pandas directly to azure blob storage 熊猫数据框保存到csv文件中 - Pandas Data Frame saving into csv file 使用 python 将抓取的数据直接保存到 azure blob 存储 - Saving scraped data directly to azure blob storage with python 将 pandas dataframe 上传到 azure blob 而不创建 Z628CB5675FF524F3E719BFEAAZ8 本地文件 - upload pandas dataframe to azure blob without creating csv local file 如何将.h5 文件(Keras)上传到 Azure Blob 存储而不将其保存在 Python 的磁盘中 - How to upload .h5 file(Keras) to Azure Blob Storage without saving it in disk in Python 以更具可读性的方式将熊猫数据框另存为CSV或XLSX - Saving a Pandas data frame as CSV or XLSX in a more readable fashion 如何加快 pandas 将数据帧保存到 csv? - How to speed up pandas saving data frame to csv? 将pandas数据帧保存到csv文件时的附加列 - additional column when saving pandas data frame to csv file
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM