简体   繁体   English

如何使用 Python delta-rs 从 Azure Blob 存储中读取数据

[英]How to read from Azure Blob Storage with Python delta-rs

I'd like to use the Python bindings to delta-rs to read from my blob storage.我想使用Python 绑定到 delta-rs从我的 blob 存储中读取。

Currently I am kind of lost, since I cannot figure out how to configure the filesystem on my local machine.目前我有点迷茫,因为我不知道如何在我的本地机器上配置文件系统。 Where do I have to put my credentials?我必须将凭据放在哪里?

Can I use adlfs for this?我可以为此使用 adlfs 吗?

from adlfs import AzureBlobFileSystem
    
fs = AzureBlobFileSystem(
        account_name="...", 
        account_key='...'
    )

and then use the fs object?然后使用 fs object?

Unfortunately we don't have great documentation around this at the moment.不幸的是,我们目前没有很好的文档。 You should be able to set AZURE_STORAGE_ACCOUNT and AZURE_STORAGE_SAS environment variables a la this integration test .您应该能够在此集成测试中设置AZURE_STORAGE_ACCOUNTAZURE_STORAGE_SAS环境变量。

That will ensure the Python bindings can access table metadata, but typically fetching of the data for query is done through Pandas, and I'm not sure if Pandas will handle these variables as well (not an ADLSv2 user myself)..这将确保 Python 绑定可以访问表元数据,但通常通过 Pandas 获取数据以进行查询,我不确定 Pandas 自己是否会处理这些变量。

One possible workaround is to download the delta lake files to a tmp-dir and read the files using python-delta-rs with something like this:一种可能的解决方法是将 delta lake 文件下载到 tmp-dir 并使用 python-delta-rs 读取文件,如下所示:

from azure.storage.blob import BlobServiceClient
import tempfile
from deltalake import DeltaTable

def get_blobs_for_folder(container_client, blob_storage_folder_path):
    blob_iter = container_client.list_blobs(name_starts_with=blob_storage_folder_path)
    blob_names = []
    for blob in blob_iter:
        if "." in blob.name:
            # To just get files and not directories, there might be a better way to do this
            blob_names.append(blob.name)

    return blob_names


def download_blob_files(container_client, blob_names, local_folder):
    for blob_name in blob_names:
        local_filename = os.path.join(local_folder, blob_name)
        local_file_dir = os.path.dirname(local_filename)
        if not os.path.exists(local_file_dir):
            os.makedirs(local_file_dir)

        with open(local_filename, 'wb') as f:
            f.write(container_client.download_blob(blob_name).readall())


def read_delta_lake_file_to_df(blob_storage_path, access_key):
    blob_storage_url = "https://your-blob-storage"
    blob_service_client = BlobServiceClient(blob_storage_url, credential=access_key)
    container_client = blob_service_client.get_container_client("your-container-name")

    blob_names = get_blobs_for_folder(container_client, blob_storage_path)
    with tempfile.TemporaryDirectory() as tmp_dirpath:
        download_blob_files(container_client, blob_names, tmp_dirpath)
        local_filename = os.path.join(tmp_dirpath, blob_storage_path)
        dt = DeltaTable(local_filename)
        df = dt.to_pyarrow_table().to_pandas()
    return df


I don't know about delta-rs but you can use this object directly with pandas.我不知道 delta-rs,但你可以将这个 object 直接与 pandas 一起使用。

abfs = AzureBlobFileSystem(account_name="account_name", account_key="access_key", container_name="name_of_container")
df = pd.read_parquet("path/of/file/with/container_name/included",filesystem=abfs)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM