[英]How to read from Azure Blob Storage with Python delta-rs
我想使用Python 绑定到 delta-rs从我的 blob 存储中读取。
目前我有点迷茫,因为我不知道如何在我的本地机器上配置文件系统。 我必须将凭据放在哪里?
我可以为此使用 adlfs 吗?
from adlfs import AzureBlobFileSystem
fs = AzureBlobFileSystem(
account_name="...",
account_key='...'
)
然后使用 fs object?
不幸的是,我们目前没有很好的文档。 您应该能够在此集成测试中设置AZURE_STORAGE_ACCOUNT
和AZURE_STORAGE_SAS
环境变量。
这将确保 Python 绑定可以访问表元数据,但通常通过 Pandas 获取数据以进行查询,我不确定 Pandas 自己是否会处理这些变量。
一种可能的解决方法是将 delta lake 文件下载到 tmp-dir 并使用 python-delta-rs 读取文件,如下所示:
from azure.storage.blob import BlobServiceClient
import tempfile
from deltalake import DeltaTable
def get_blobs_for_folder(container_client, blob_storage_folder_path):
blob_iter = container_client.list_blobs(name_starts_with=blob_storage_folder_path)
blob_names = []
for blob in blob_iter:
if "." in blob.name:
# To just get files and not directories, there might be a better way to do this
blob_names.append(blob.name)
return blob_names
def download_blob_files(container_client, blob_names, local_folder):
for blob_name in blob_names:
local_filename = os.path.join(local_folder, blob_name)
local_file_dir = os.path.dirname(local_filename)
if not os.path.exists(local_file_dir):
os.makedirs(local_file_dir)
with open(local_filename, 'wb') as f:
f.write(container_client.download_blob(blob_name).readall())
def read_delta_lake_file_to_df(blob_storage_path, access_key):
blob_storage_url = "https://your-blob-storage"
blob_service_client = BlobServiceClient(blob_storage_url, credential=access_key)
container_client = blob_service_client.get_container_client("your-container-name")
blob_names = get_blobs_for_folder(container_client, blob_storage_path)
with tempfile.TemporaryDirectory() as tmp_dirpath:
download_blob_files(container_client, blob_names, tmp_dirpath)
local_filename = os.path.join(tmp_dirpath, blob_storage_path)
dt = DeltaTable(local_filename)
df = dt.to_pyarrow_table().to_pandas()
return df
我不知道 delta-rs,但你可以将这个 object 直接与 pandas 一起使用。
abfs = AzureBlobFileSystem(account_name="account_name", account_key="access_key", container_name="name_of_container")
df = pd.read_parquet("path/of/file/with/container_name/included",filesystem=abfs)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.