簡體   English   中英

從 azure blob 存儲讀取 xlsx 到 pandas dataframe 而不創建臨時文件

[英]Read xlsx from azure blob storage to pandas dataframe without creating temporary file

我正在嘗試從 Azure blob 存儲讀取 xlsx 文件到 pandas dataframe 而不創建臨時本地文件。 I have seen many similar questions, eg Issues Reading Azure Blob CSV Into Python Pandas DF , but haven't managed to get the proposed solutions to work.

下面的代碼片段導致UnicodeDecodeError: 'utf-8' codec can't decode byte 0x87 in position 14: invalid start byte exception。

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient

blob_client = BlobClient.from_blob_url(blob_url = url + container + "/" + blobname, credential = token)   
blob = blob_client.download_blob().content_as_text()   
df = pd.read_excel(StringIO(blob))

使用臨時文件,我確實設法使其與以下代碼片段一起工作:

blob_service_client = BlobServiceClient(account_url = url, credential = token)
blob_client = blob_service_client.get_blob_client(container=container, blob=blobname)

with open(tmpfile, "wb") as my_blob:
    download_stream = blob_client.download_blob()
    my_blob.write(download_stream.readall())

data = pd.read_excel(tmpfile)

與您已經完成的類似,我們可以使用download_blob()StorageStreamDownloader object 放入 memory,然后context_as_text()將內容解碼為字符串。

然后我們可以從 CSV StringIO緩沖區讀取內容到 pandas Dataframe 與pandas.read_csv()

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient
import os

connection_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="blobs", blob="test.csv")

blob = blob_client.download_blob().content_as_text()

df = pd.read_csv(StringIO(blob))

更新

如果我們正在使用 XLSX 文件,請使用content_as_bytes()返回字節而不是字符串,並轉換為 pandas dataframe 和pandas.read_excel()

from io import StringIO
import pandas as pd
from azure.storage.blob import BlobClient, BlobServiceClient
import os

connection_string = os.getenv('AZURE_STORAGE_CONNECTION_STRING')

blob_service_client = BlobServiceClient.from_connection_string(connection_string)

blob_client = blob_service_client.get_blob_client(container="blobs", blob="test.xlsx")

blob = blob_client.download_blob().content_as_bytes()

df = pd.read_excel(blob)

由於content_as_text()默認使用 UTF-8 編碼,這可能是解碼字節時導致UnicodeDecodeError的原因。

如果我們將編碼設置為None ,我們仍然可以將它與pandas.read_excel()一起使用:

blob = blob_client.download_blob().content_as_text(encoding=None)

df = pd.read_excel(blob)

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM