如何使用 Python delta-rs 從 Azure Blob 存儲中讀取數據

Question

我想使用Python 綁定到 delta-rs從我的 blob 存儲中讀取。

目前我有點迷茫，因為我不知道如何在我的本地機器上配置文件系統。 我必須將憑據放在哪里？

我可以為此使用 adlfs 嗎？

from adlfs import AzureBlobFileSystem
    
fs = AzureBlobFileSystem(
        account_name="...", 
        account_key='...'
    )

然后使用 fs object？

Answer 1

不幸的是，我們目前沒有很好的文檔。 您應該能夠在此集成測試中設置AZURE_STORAGE_ACCOUNT和AZURE_STORAGE_SAS環境變量。

這將確保 Python 綁定可以訪問表元數據，但通常通過 Pandas 獲取數據以進行查詢，我不確定 Pandas 自己是否會處理這些變量。

Answer 2

一種可能的解決方法是將 delta lake 文件下載到 tmp-dir 並使用 python-delta-rs 讀取文件，如下所示：

from azure.storage.blob import BlobServiceClient
import tempfile
from deltalake import DeltaTable

def get_blobs_for_folder(container_client, blob_storage_folder_path):
    blob_iter = container_client.list_blobs(name_starts_with=blob_storage_folder_path)
    blob_names = []
    for blob in blob_iter:
        if "." in blob.name:
            # To just get files and not directories, there might be a better way to do this
            blob_names.append(blob.name)

    return blob_names


def download_blob_files(container_client, blob_names, local_folder):
    for blob_name in blob_names:
        local_filename = os.path.join(local_folder, blob_name)
        local_file_dir = os.path.dirname(local_filename)
        if not os.path.exists(local_file_dir):
            os.makedirs(local_file_dir)

        with open(local_filename, 'wb') as f:
            f.write(container_client.download_blob(blob_name).readall())


def read_delta_lake_file_to_df(blob_storage_path, access_key):
    blob_storage_url = "https://your-blob-storage"
    blob_service_client = BlobServiceClient(blob_storage_url, credential=access_key)
    container_client = blob_service_client.get_container_client("your-container-name")

    blob_names = get_blobs_for_folder(container_client, blob_storage_path)
    with tempfile.TemporaryDirectory() as tmp_dirpath:
        download_blob_files(container_client, blob_names, tmp_dirpath)
        local_filename = os.path.join(tmp_dirpath, blob_storage_path)
        dt = DeltaTable(local_filename)
        df = dt.to_pyarrow_table().to_pandas()
    return df

Answer 3

我不知道 delta-rs，但你可以將這個 object 直接與 pandas 一起使用。

abfs = AzureBlobFileSystem(account_name="account_name", account_key="access_key", container_name="name_of_container")
df = pd.read_parquet("path/of/file/with/container_name/included",filesystem=abfs)

如何使用 Python delta-rs 從 Azure Blob 存儲中讀取數據

問題描述

3 個解決方案

解決方案1
2 已采納 2021-04-30 15:30:03

解決方案2
0 2022-11-02 14:35:56

解決方案3
0 2022-12-12 08:27:59

如何使用 Python delta-rs 從 Azure Blob 存儲中讀取數據

問題描述

3 個解決方案

解決方案1 2 已采納 2021-04-30 15:30:03

解決方案2 0 2022-11-02 14:35:56

解決方案3 0 2022-12-12 08:27:59

解決方案1
2 已采納 2021-04-30 15:30:03

解決方案2
0 2022-11-02 14:35:56

解決方案3
0 2022-12-12 08:27:59