简体   繁体   English

Azure Databricks:Python 并行循环

[英]Azure Databricks: Python parallel for loop

I am using Azure Databricks to analyze some data.我正在使用 Azure Databricks 来分析一些数据。 I have the following folder structure in blob storage:我在 blob 存储中有以下文件夹结构:

folder_1\n1 csv files
folder_2\n2 csv files
..
folder_k\nk csv files

I want to read these files, run some algorithm (relatively simple) and write out some log files and image files for each of the csv files in a similar folder structure at another blob storage location.我想读取这些文件,运行一些算法(相对简单)并在另一个 blob 存储位置的类似文件夹结构中为每个 csv 文件写出一些日志文件和图像文件。 Right now I have a simple loop structure to do this:现在我有一个简单的循环结构来做到这一点:

for folder in folders:
  #set up some stuff
  for file in files:
    #do the work and write out results

The database contains 150k files.该数据库包含 150k 个文件。 Is there a way to parallelize this?有没有办法并行化这个?

The best way I found to parallelize such embarassingly parallel tasks in databricks is using pandas UDF ( https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html?_ga=2.143957493.1972283838.1643225636-354359200.1607978015 )我发现在databricks中并行化这种令人尴尬的并行任务的最佳方法是使用pandas UDF( https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type- the-upcoming-release-of-apache-spark-3-0.html?_ga=2.143957493.1972283838.1643225636-354359200.1607978015

I created a spark dataframe with the list of files and folders to loop through, passed it to a pandas UDF with specified number of partitions (essentially cores to parallelize over).我创建了一个 spark dataframe,其中包含要循环的文件和文件夹列表,将其传递给具有指定分区数量的 pandas UDF(基本上是要并行化的核心)。 This can leverage the available cores on a databricks cluster.这可以利用数据块集群上的可用内核。 There are a few restrictions as to what you can call from a pandas UDF (for example, cannot use 'dbutils' calls directly), but it worked like a charm for my application.从 pandas UDF 调用的内容有一些限制(例如,不能直接使用“dbutils”调用),但它对我的应用程序来说就像一个魅力。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM