简体   繁体   English

从 azure blob 并行读取多个文件

[英]Read multiple files in parellel from an azure blob

I'm looking to read a bunch on small files from an azure blob, this can be in the order of 1k-100k files summing up few 1TB in total.我希望从 azure blob 中读取一堆小文件,这可能是 1k-100k 文件的顺序,总共只有几个 1TB。 I have to process this files in python, the processing it self is not heavy, but reading the files from the blob it does takes time.我必须在 python 中处理这些文件,它本身的处理并不繁重,但是从 blob 中读取文件确实需要时间。 One other constrain of this, is that new files are been writeen while I process the first one.另一个限制是,当我处理第一个文件时,新文件被写入。

I'm looking for options to do this, is it possible to use dask to read several files from the blob in parallel?我正在寻找执行此操作的选项,是否可以使用 dask 从 blob 中并行读取多个文件? or is it possible to transfer and load over 1tb per hour inside the azure network?或者是否可以在 azure 网络内每小时传输和加载超过 1tb?

Well, you have a couple of options to achieve parallelism here:好吧,您有几个选项可以在这里实现并行性:

Multi-threading:多线程:

Below uses uses ThreadPool class in Python to download and process files in parallel from Azure storage.下面使用 Python 中的ThreadPool class 从 Azure 存储中并行下载和处理文件。 Note: using v12 storage sdk注意:使用v12存储sdk

import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient

STORAGE_CONNECTION_STRING = "REPLACE_THIS"
BLOB_CONTAINER = "myfiles"

class AzureBlobProcessor:
  def __init__(self): 
    # Initialize client
    self.blob_service_client =  BlobServiceClient.from_connection_string(STORAGE_CONNECTION_STRING)
    self.blob_container = self.blob_service_client.get_container_client(BLOB_CONTAINER)
 
  def process_all_blobs_in_container(self):
    # get a list of blobs
    blobs = self.blob_container.list_blobs()
    result = self.execute(blobs)
 
  def execute(self, blobs):
    # Just sample number of threads as 10
    with ThreadPool(processes=int(10)) as pool:
     return pool.map(self.download_and_process_blob, blobs)
 
  def download_and_process_blob(self,blob):
    file_name = blob.name
    
    # below is just sample which reads bytes, update to variant you need
    bytes = self.blob_container.get_blob_client(blob).download_blob().readall()
 
    # processing logic goes here :)

    return file_name
 
# caller code
azure_blob_processor = AzureBlobProcessor()
azure_blob_processor.process_all_blobs_in_container()

You can also look at dask remote data read .您还可以查看dask remote data read Check https://github.com/dask/adlfs检查https://github.com/dask/adlfs

To use the Gen1 filesystem:要使用 Gen1 文件系统:

import dask.dataframe as dd

storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}

dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)

To use the Gen2 filesystem you can use the protocol abfs or az :要使用 Gen2 文件系统,您可以使用协议abfsaz

import dask.dataframe as dd

storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}

ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)

To read from a public storage blob you are required to specify the 'account_name' .要从公共存储 blob 中读取,您需要指定'account_name' For example, you can access NYC Taxi & Limousine Commission as:例如,您可以通过以下方式访问NYC Taxi & Limousine Commission

storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)

Achieve parallelism leveraging Azure PaaS:利用 Azure PaaS 实现并行性:

Well, you have multiple options in this path.好吧,您在这条道路上有多种选择。


Lastly, I suggest you a deep dive at Performance and scalability checklist for Blob storage to ensure you are within limit of Azure storage account data transfer.最后,我建议您深入了解 Blob 存储的性能和可扩展性清单,以确保您在 Azure 存储帐户数据传输的限制范围内。 Also Scalability and performance targets for standard storage accounts .还有标准存储帐户的可扩展性和性能目标 Looking at your 1 tb per hour requirement, it seems under limit if you convert the gbps from the above doc.查看您每小时 1 tb 的要求,如果您从上述文档转换 gbps,它似乎处于限制之下。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 python azure 函数从 azure blob 存储读取文件 - Read files from azure blob storage using python azure functions 使用 python 从 Azure blob 读取 Json 文件? - Read Json files from Azure blob using python? dask:如何从Microsoft Azure Blob将CSV文件读入DataFrame - dask : How to read CSV files into a DataFrame from Microsoft Azure Blob 如何使用 Python 从 azure blob 读取 docx 文件 - How to read docx files from azure blob using Python 想要将多个文件从Linux服务器复制到Azure blob - want to copy multiple files to Azure blob from Linux server 从 Azure blob 读取模式多个镶木地板文件 - Reading schema multiple parquet files from Azure blob AZURE Function 从 AZURE BLOB 读取 XLSX - AZURE Function read XLSX from AZURE BLOB 是否可以从 Azure Blob Storage 容器中读取所有文件,并在使用 Python 读取后删除文件? - Is it possible to read in all the files from an Azure Blob Storage container, and deleting the files after reading with Python? 使用适用于 Python 的 Azure 存储 SDK 将多个文件从文件夹上传到 Azure Blob 存储 - Upload multiple files from folder to Azure Blob storage using Azure Storage SDK for Python 使用 azure 函数将表单数据请求中的多个文件细化并存储在 azure blob 中 - Elaborate and store inside an azure blob multiple files from a form-data request by using azure functions
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM