[英]Read csv from Azure blob Storage and store in a DataFrame
I'm trying to read multiple CSV files from blob storage using python.我正在尝试使用 python 从 blob 存储中读取多个 CSV 文件。
The code that I'm using is:我正在使用的代码是:
blob_service_client = BlobServiceClient.from_connection_string(connection_str)
container_client = blob_service_client.get_container_client(container)
blobs_list = container_client.list_blobs(folder_root)
for blob in blobs_list:
blob_client = blob_service_client.get_blob_client(container=container, blob="blob.name")
stream = blob_client.download_blob().content_as_text()
I'm not sure what is the correct way to store the CSV files read in a pandas dataframe.我不确定存储在 pandas dataframe 中读取的 CSV 文件的正确方法是什么。
I tried to use:我尝试使用:
df = df.append(pd.read_csv(StringIO(stream)))
But this shows me an error.但这向我显示了一个错误。
Any idea how can I to do this?知道我该怎么做吗?
You could download the file from blob storage, then read the data into a pandas DataFrame from the downloaded file.您可以从 blob 存储下载文件,然后从下载的文件中将数据读入 pandas DataFrame。
from azure.storage.blob import BlockBlobService
import pandas as pd
import tables
STORAGEACCOUNTNAME= <storage_account_name>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
#download from blob
t1=time.time()
blob_service=BlockBlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
t2=time.time()
print(("It takes %s seconds to download "+blobname) % (t2 - t1))
# LOCALFILE is the file path
dataframe_blobdata = pd.read_csv(LOCALFILENAME)
For more details, see here .有关更多详细信息,请参见此处。
If you want to do the conversion directly, the code will help.如果您想直接进行转换,代码将有所帮助。 You need to get content from the blob object and in the
get_blob_to_text
there's no need for the local file name.您需要从 Blob object 获取内容,并且在
get_blob_to_text
中不需要本地文件名。
from io import StringIO
blobstring = blob_service.get_blob_to_text(CONTAINERNAME,BLOBNAME).content
df = pd.read_csv(StringIO(blobstring))
import pandas as pd
data = pd.read_csv('blob_sas_url')
The Blob SAS Url can be found by right clicking on the azure portal's blob file that you want to import and selecting Generate SAS. The Blob SAS Url can be found by right clicking on the azure portal's blob file that you want to import and selecting Generate SAS. Then, click Generate SAS token and URL button and copy the SAS url to above code in place of blob_sas_url.
Then, click Generate SAS token and URL button and copy the SAS url to above code in place of blob_sas_url.
You can now directly read from BlobStorage into a Pandas DataFrame:您现在可以直接从 BlobStorage 读取到 Pandas DataFrame:
mydata = pd.read_csv(
f"abfs://{blob_path}",
storage_options={
"connection_string": os.environ["STORAGE_CONNECTION"]
})
where blob_path
is the path to your file, given as {container-name}/{blob-preifx.csv}
其中
blob_path
是文件的路径,以{container-name}/{blob-preifx.csv}
The BlockBlobService as part of azure-storage is deprecated.作为 azure-storage 一部分的 BlockBlobService 已弃用。 Use below instead:
请改用以下内容:
!pip install azure-storage-blob
from azure.storage.blob import BlobServiceClient
import pandas as pd
STORAGEACCOUNTURL= <storage_account_url>
STORAGEACCOUNTKEY= <storage_account_key>
LOCALFILENAME= <local_file_name>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
#download from blob
blob_service_client_instance=BlobServiceClient(account_url=STORAGEACCOUNTURL, credential=STORAGEACCOUNTKEY)
blob_client_instance = blob_service_client_instance.get_blob_client(CONTAINERNAME, BLOBNAME, snapshot=None)
with open(LOCALFILENAME, "wb") as my_blob:
blob_data = blob_client_instance.download_blob()
blob_data.readinto(my_blob)
#import blob to dataframe
df = pd.read_csv(DataFrame)
LOCALFILENAME is the same as BLOBNAME LOCALFILENAME 与 BLOBNAME 相同
BlockBlobService is indeed deprecated. BlockBlobService 确实已被弃用。 However, @Deepak's answer doesn't work for me.
但是,@Deepak 的回答对我不起作用。 Below works:
以下作品:
import pandas as pd
from io import BytesIO
from azure.storage.blob import BlobServiceClient
CONNECTION_STRING= <connection_string>
CONTAINERNAME= <container_name>
BLOBNAME= <blob_name>
blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = blob_service_client.get_container_client(CONTAINERNAME)
blob_client = container_client.get_blob_client(BLOBNAME)
with BytesIO() as input_blob:
blob_client.download_blob().download_to_stream(input_blob)
input_blob.seek(0)
df = pd.read_csv(input_blob)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.