简体   繁体   English

从 Azure blob Storage 读取 csv 并存储在 DataFrame

[英]Read csv from Azure blob Storage and store in a DataFrame

I'm trying to read multiple CSV files from blob storage using python.我正在尝试使用 python 从 blob 存储中读取多个 CSV 文件。

The code that I'm using is:我正在使用的代码是:

blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs(folder_root)
for blob in blobs_list:
    blob_client = blob_service_client.get_blob_client(container=container, blob="blob.name")
    stream = blob_client.download_blob().content_as_text()

I'm not sure what is the correct way to store the CSV files read in a pandas dataframe.我不确定存储在 pandas dataframe 中读取的 CSV 文件的正确方法是什么。

I tried to use:我尝试使用:

df = df.append(pd.read_csv(StringIO(stream)))

But this shows me an error.但这向我显示了一个错误。

Any idea how can I to do this?知道我该怎么做吗?

You could download the file from blob storage, then read the data into a pandas DataFrame from the downloaded file.您可以从 blob 存储下载文件,然后从下载的文件中将数据读入 pandas DataFrame。

from azure.storage.blob import BlockBlobService
import pandas as pd
import tables

STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob
t1=time.time()
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))

# LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)

For more details, see here .有关更多详细信息,请参见此处


If you want to do the conversion directly, the code will help.如果您想直接进行转换,代码将有所帮助。 You need to get content from the blob object and in the get_blob_to_text there's no need for the local file name.您需要从 Blob object 获取内容,并且在get_blob_to_text中不需要本地文件名。

from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
import pandas as pd
data = pd.read_csv('blob_sas_url')

The Blob SAS Url can be found by right clicking on the azure portal's blob file that you want to import and selecting Generate SAS. The Blob SAS Url can be found by right clicking on the azure portal's blob file that you want to import and selecting Generate SAS. Then, click Generate SAS token and URL button and copy the SAS url to above code in place of blob_sas_url. Then, click Generate SAS token and URL button and copy the SAS url to above code in place of blob_sas_url.

You can now directly read from BlobStorage into a Pandas DataFrame:您现在可以直接从 BlobStorage 读取到 Pandas DataFrame:

mydata = pd.read_csv(
        f"abfs://{blob_path}",
        storage_options={
            "connection_string": os.environ["STORAGE_CONNECTION"]
    })

where blob_path is the path to your file, given as {container-name}/{blob-preifx.csv}其中blob_path是文件的路径,以{container-name}/{blob-preifx.csv}

The BlockBlobService as part of azure-storage is deprecated.作为 azure-storage 一部分的 BlockBlobService 已弃用。 Use below instead:请改用以下内容:

!pip install azure-storage-blob
from azure.storage.blob import BlobServiceClient
import pandas as pd

STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

#download from blob
blob_service_client_instance=BlobServiceClient(account_url=STORAGEACCOUNTURL, credential=STORAGEACCOUNTKEY)
blob_client_instance = blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME, snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
    blob_data = blob_client_instance.download_blob()
    blob_data.readinto(my_blob)

#import blob to dataframe
df = pd.read_csv(DataFrame)

LOCALFILENAME is the same as BLOBNAME LOCALFILENAME 与 BLOBNAME 相同

BlockBlobService is indeed deprecated. BlockBlobService 确实已被弃用。 However, @Deepak's answer doesn't work for me.但是,@Deepak 的回答对我不起作用。 Below works:以下作品:

import pandas as pd
from io import BytesIO
from azure.storage.blob import BlobServiceClient

CONNECTION_STRING= <connection_string>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>

blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = blob_service_client.get_container_client(CONTAINERNAME)
blob_client = container_client.get_blob_client(BLOBNAME)

with BytesIO() as input_blob:
    blob_client.download_blob().download_to_stream(input_blob)
    input_blob.seek(0)
    df = pd.read_csv(input_blob)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 dask:如何从Microsoft Azure Blob将CSV文件读入DataFrame - dask : How to read CSV files into a DataFrame from Microsoft Azure Blob 从 Azure Blob 存储中读取 CSV 文件,而不知道 python 中的 csv 文件名 - Read CSV file from Azure Blob Storage with out knowing the csv file name in python 从 Azure Blob 存储读取 XML 文件 - Read an XML file from Azure Blob Storage 从 python 中的 azure blob 存储读取文件 - read file from azure blob storage in python 从 Blob 存储读取 CSV 文件到 pandas dataframe 并忽略来自源系统的分页行 - Read CSV file from Blob Storage to pandas dataframe and ignore pagination rows from source system 从 azure blob 存储读取 xlsx 到 pandas dataframe 而不创建临时文件 - Read xlsx from azure blob storage to pandas dataframe without creating temporary file Using azure.storage.blob to write Python DataFrame as CSV into Azure Blob - Using azure.storage.blob to write Python DataFrame as CSV into Azure Blob 将 csv 个文件从 azure blob 存储 A 复制到 azure blob 存储 B - copy csv files from azure blob storage A to azure blob storage B 使用 python azure 函数从 azure blob 存储读取文件 - Read files from azure blob storage using python azure functions 将 Azure 存储 blob 读入 matplotlib - Read Azure Storage blob into matplotlib
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM