[英]Read multiple files in parellel from an azure blob
I'm looking to read a bunch on small files from an azure blob, this can be in the order of 1k-100k files summing up few 1TB in total.我希望从 azure blob 中读取一堆小文件,这可能是 1k-100k 文件的顺序,总共只有几个 1TB。 I have to process this files in python, the processing it self is not heavy, but reading the files from the blob it does takes time.
我必须在 python 中处理这些文件,它本身的处理并不繁重,但是从 blob 中读取文件确实需要时间。 One other constrain of this, is that new files are been writeen while I process the first one.
另一个限制是,当我处理第一个文件时,新文件被写入。
I'm looking for options to do this, is it possible to use dask to read several files from the blob in parallel?我正在寻找执行此操作的选项,是否可以使用 dask 从 blob 中并行读取多个文件? or is it possible to transfer and load over 1tb per hour inside the azure network?
或者是否可以在 azure 网络内每小时传输和加载超过 1tb?
Well, you have a couple of options to achieve parallelism here:好吧,您有几个选项可以在这里实现并行性:
Below uses uses ThreadPool
class in Python to download and process files in parallel from Azure storage.下面使用 Python 中的
ThreadPool
class 从 Azure 存储中并行下载和处理文件。 Note: using v12 storage sdk注意:使用v12存储sdk
import os
from multiprocessing.pool import ThreadPool
from azure.storage.blob import BlobServiceClient
STORAGE_CONNECTION_STRING = "REPLACE_THIS"
BLOB_CONTAINER = "myfiles"
class AzureBlobProcessor:
def __init__(self):
# Initialize client
self.blob_service_client = BlobServiceClient.from_connection_string(STORAGE_CONNECTION_STRING)
self.blob_container = self.blob_service_client.get_container_client(BLOB_CONTAINER)
def process_all_blobs_in_container(self):
# get a list of blobs
blobs = self.blob_container.list_blobs()
result = self.execute(blobs)
def execute(self, blobs):
# Just sample number of threads as 10
with ThreadPool(processes=int(10)) as pool:
return pool.map(self.download_and_process_blob, blobs)
def download_and_process_blob(self,blob):
file_name = blob.name
# below is just sample which reads bytes, update to variant you need
bytes = self.blob_container.get_blob_client(blob).download_blob().readall()
# processing logic goes here :)
return file_name
# caller code
azure_blob_processor = AzureBlobProcessor()
azure_blob_processor.process_all_blobs_in_container()
You can also look at dask remote data read .您还可以查看dask remote data read 。 Check https://github.com/dask/adlfs
检查https://github.com/dask/adlfs
To use the Gen1 filesystem:要使用 Gen1 文件系统:
import dask.dataframe as dd
storage_options={'tenant_id': TENANT_ID, 'client_id': CLIENT_ID, 'client_secret': CLIENT_SECRET}
dd.read_csv('adl://{STORE_NAME}/{FOLDER}/*.csv', storage_options=storage_options)
To use the Gen2 filesystem you can use the protocol abfs
or az
:要使用 Gen2 文件系统,您可以使用协议
abfs
或az
:
import dask.dataframe as dd
storage_options={'account_name': ACCOUNT_NAME, 'account_key': ACCOUNT_KEY}
ddf = dd.read_csv('abfs://{CONTAINER}/{FOLDER}/*.csv', storage_options=storage_options)
ddf = dd.read_parquet('az://{CONTAINER}/folder.parquet', storage_options=storage_options)
To read from a public storage blob you are required to specify the 'account_name'
.要从公共存储 blob 中读取,您需要指定
'account_name'
。 For example, you can access NYC Taxi & Limousine Commission as:例如,您可以通过以下方式访问NYC Taxi & Limousine Commission :
storage_options = {'account_name': 'azureopendatastorage'}
ddf = dd.read_parquet('az://nyctlc/green/puYear=2019/puMonth=*/*.parquet', storage_options=storage_options)
Well, you have multiple options in this path.好吧,您在这条道路上有多种选择。
Lastly, I suggest you a deep dive at Performance and scalability checklist for Blob storage to ensure you are within limit of Azure storage account data transfer.最后,我建议您深入了解 Blob 存储的性能和可扩展性清单,以确保您在 Azure 存储帐户数据传输的限制范围内。 Also Scalability and performance targets for standard storage accounts .
还有标准存储帐户的可扩展性和性能目标。 Looking at your 1 tb per hour requirement, it seems under limit if you convert the gbps from the above doc.
查看您每小时 1 tb 的要求,如果您从上述文档转换 gbps,它似乎处于限制之下。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.