从 Azure blob Storage 读取 csv 并存储在 DataFrame

Question

I'm trying to read multiple CSV files from blob storage using python.我正在尝试使用 python 从 blob 存储中读取多个 CSV 文件。

The code that I'm using is:我正在使用的代码是：

blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs(folder_root)
for blob in blobs_list:
    blob_client = blob_service_client.get_blob_client(container=container, blob="blob.name")
    stream = blob_client.download_blob().content_as_text()

I'm not sure what is the correct way to store the CSV files read in a pandas dataframe.我不确定存储在 pandas dataframe 中读取的 CSV 文件的正确方法是什么。

I tried to use:我尝试使用：

df = df.append(pd.read_csv(StringIO(stream)))

But this shows me an error.但这向我显示了一个错误。

Any idea how can I to do this?知道我该怎么做吗？

Answer 1

You could download the file from blob storage, then read the data into a pandas DataFrame from the downloaded file.您可以从 blob 存储下载文件，然后从下载的文件中将数据读入 pandas DataFrame。

from azure.storage.blob import BlockBlobService
import pandas as pd
import tables

STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob
t1=time.time()
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))

# LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)

For more details, see here .有关更多详细信息，请参见此处。

If you want to do the conversion directly, the code will help.如果您想直接进行转换，代码将有所帮助。 You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.您需要从 Blob object 获取内容，并且在get_blob_to_text中不需要本地文件名。

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))

Answer 2

import pandas as pd
data = pd.read_csv('blob_sas_url')

The Blob SAS Url can be found by right clicking on the azure portal's blob file that you want to import and selecting Generate SAS. The Blob SAS Url can be found by right clicking on the azure portal's blob file that you want to import and selecting Generate SAS. Then, click Generate SAS token and URL button and copy the SAS url to above code in place of blob_sas_url. Then, click Generate SAS token and URL button and copy the SAS url to above code in place of blob_sas_url.

Answer 3

You can now directly read from BlobStorage into a Pandas DataFrame:您现在可以直接从 BlobStorage 读取到 Pandas DataFrame：

mydata = pd.read_csv(
        f"abfs://{blob_path}",
        storage_options={
            "connection_string": os.environ["STORAGE_CONNECTION"]
    })

where blob_path is the path to your file, given as {container-name}/{blob-preifx.csv}其中blob_path是文件的路径，以{container-name}/{blob-preifx.csv}

Answer 4

The BlockBlobService as part of azure-storage is deprecated.作为 azure-storage 一部分的 BlockBlobService 已弃用。 Use below instead:请改用以下内容：

!pip install azure-storage-blob
from azure.storage.blob import BlobServiceClient
import pandas as pd

STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob
blob_service_client_instance=BlobServiceClient(account_url=STORAGEACCOUNTURL, credential=STORAGEACCOUNTKEY)
blob_client_instance = blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME, snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
    blob_data = blob_client_instance.download_blob()
    blob_data.readinto(my_blob)

#import blob to dataframe
df = pd.read_csv(DataFrame)

LOCALFILENAME is the same as BLOBNAME LOCALFILENAME 与 BLOBNAME 相同

Answer 5

BlockBlobService is indeed deprecated. BlockBlobService 确实已被弃用。 However, @Deepak's answer doesn't work for me.但是，@Deepak 的回答对我不起作用。 Below works:以下作品：

import pandas as pd
from io import BytesIO
from azure.storage.blob import BlobServiceClient

CONNECTION_STRING= <connection_string>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = blob_service_client.get_container_client(CONTAINERNAME)
blob_client = container_client.get_blob_client(BLOBNAME)

with BytesIO() as input_blob:
    blob_client.download_blob().download_to_stream(input_blob)
    input_blob.seek(0)
    df = pd.read_csv(input_blob)

从 Azure blob Storage 读取 csv 并存储在 DataFrame

问题描述

5 个解决方案

解决方案1
2 已采纳 2020-07-01 07:14:26

解决方案2
0 2021-08-02 10:58:06

解决方案3
0 2022-04-25 09:42:40

解决方案4
0 2022-08-24 22:58:38

解决方案5
0 2022-08-30 16:47:47

从 Azure blob Storage 读取 csv 并存储在 DataFrame

问题描述

5 个解决方案

解决方案1 2 已采纳 2020-07-01 07:14:26

解决方案2 0 2021-08-02 10:58:06

解决方案3 0 2022-04-25 09:42:40

解决方案4 0 2022-08-24 22:58:38

解决方案5 0 2022-08-30 16:47:47

解决方案1
2 已采纳 2020-07-01 07:14:26

解决方案2
0 2021-08-02 10:58:06

解决方案3
0 2022-04-25 09:42:40

解决方案4
0 2022-08-24 22:58:38

解决方案5
0 2022-08-30 16:47:47