从 azure blob 并行读取多个文件

Question

I'm looking to read a bunch on small files from an azure blob, this can be in the order of 1k-100k files summing up few 1TB in total.我希望从 azure blob 中读取一堆小文件，这可能是 1k-100k 文件的顺序，总共只有几个 1TB。 I have to process this files in python, the processing it self is not heavy, but reading the files from the blob it does takes time.我必须在 python 中处理这些文件，它本身的处理并不繁重，但是从 blob 中读取文件确实需要时间。 One other constrain of this, is that new files are been writeen while I process the first one.另一个限制是，当我处理第一个文件时，新文件被写入。

I'm looking for options to do this, is it possible to use dask to read several files from the blob in parallel?我正在寻找执行此操作的选项，是否可以使用 dask 从 blob 中并行读取多个文件？ or is it possible to transfer and load over 1tb per hour inside the azure network?或者是否可以在 azure 网络内每小时传输和加载超过 1tb？

Answer 1

Well, you have a couple of options to achieve parallelism here:好吧，您有几个选项可以在这里实现并行性：

Multi-threading:多线程：

Below uses uses ThreadPool class in Python to download and process files in parallel from Azure storage.下面使用 Python 中的ThreadPool class 从 Azure 存储中并行下载和处理文件。 Note: using v12 storage sdk注意：使用v12存储sdk

import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient

STORAGE_CONNECTION_STRING = "REPLACE_THIS"
BLOB_CONTAINER = "myfiles"

class AzureBlobProcessor:
  def __init__(self): 
    # Initialize client
    self.blob_service_client =  BlobServiceClient.from_connection_string(STORAGE_CONNECTION_STRING)
    self.blob_container = self.blob_service_client.get_container_client(BLOB_CONTAINER)
 
  def process_all_blobs_in_container(self):
    # get a list of blobs
    blobs = self.blob_container.list_blobs()
    result = self.execute(blobs)
 
  def execute(self, blobs):
    # Just sample number of threads as 10
    with ThreadPool(processes=int(10)) as pool:
     return pool.map(self.download_and_process_blob, blobs)
 
  def download_and_process_blob(self,blob):
    file_name = blob.name
    
    # below is just sample which reads bytes, update to variant you need
    bytes = self.blob_container.get_blob_client(blob).download_blob().readall()
 
    # processing logic goes here :)

    return file_name
 
# caller code
azure_blob_processor = AzureBlobProcessor()
azure_blob_processor.process_all_blobs_in_container()

You can also look at dask remote data read .您还可以查看dask remote data read 。 Check https://github.com/dask/adlfs检查https://github.com/dask/adlfs

To use the Gen1 filesystem:要使用 Gen1 文件系统：

import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)

To use the Gen2 filesystem you can use the protocol abfs or az :要使用 Gen2 文件系统，您可以使用协议abfs或az ：

import dask.dataframe as dd

storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)

To read from a public storage blob you are required to specify the 'account_name' .要从公共存储 blob 中读取，您需要指定'account_name' 。 For example, you can access NYC Taxi & Limousine Commission as:例如，您可以通过以下方式访问NYC Taxi & Limousine Commission ：

storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)

Achieve parallelism leveraging Azure PaaS:利用 Azure PaaS 实现并行性：

Well, you have multiple options in this path.好吧，您在这条道路上有多种选择。

Azure Batch:Tutorial: Run a parallel workload with Azure Batch using the Python API Azure 批处理：教程：使用 Azure 批处理运行并行工作负载 Python ZDB93446ACE1
Azure Function with Blob Event Grid Trigger: Azure Event Grid bindings for Azure Functions Azure Function 与 Blob 事件网格触发器： Azure Z3A580F1422803F67 的事件网格绑定

Lastly, I suggest you a deep dive at Performance and scalability checklist for Blob storage to ensure you are within limit of Azure storage account data transfer.最后，我建议您深入了解 Blob 存储的性能和可扩展性清单，以确保您在 Azure 存储帐户数据传输的限制范围内。 Also Scalability and performance targets for standard storage accounts .还有标准存储帐户的可扩展性和性能目标。 Looking at your 1 tb per hour requirement, it seems under limit if you convert the gbps from the above doc.查看您每小时 1 tb 的要求，如果您从上述文档转换 gbps，它似乎处于限制之下。

从 azure blob 并行读取多个文件

问题描述

1 个解决方案

解决方案1
1 已采纳 2021-01-11 17:32:36

Multi-threading:多线程：

Achieve parallelism leveraging Azure PaaS:利用 Azure PaaS 实现并行性：

从 azure blob 并行读取多个文件

问题描述

1 个解决方案

解决方案1 1 已采纳 2021-01-11 17:32:36

Multi-threading:多线程：

Achieve parallelism leveraging Azure PaaS:利用 Azure PaaS 实现并行性：

解决方案1
1 已采纳 2021-01-11 17:32:36