Azure Databricks: Python parallel for loop

Question

I am using Azure Databricks to analyze some data. I have the following folder structure in blob storage:

folder_1\n1 csv files
folder_2\n2 csv files
..
folder_k\nk csv files

I want to read these files, run some algorithm (relatively simple) and write out some log files and image files for each of the csv files in a similar folder structure at another blob storage location. Right now I have a simple loop structure to do this:

for folder in folders:
  #set up some stuff
  for file in files:
    #do the work and write out results

The database contains 150k files. Is there a way to parallelize this?

Answer 1

The best way I found to parallelize such embarassingly parallel tasks in databricks is using pandas UDF ( https://databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html?_ga=2.143957493.1972283838.1643225636-354359200.1607978015 )

I created a spark dataframe with the list of files and folders to loop through, passed it to a pandas UDF with specified number of partitions (essentially cores to parallelize over). This can leverage the available cores on a databricks cluster. There are a few restrictions as to what you can call from a pandas UDF (for example, cannot use 'dbutils' calls directly), but it worked like a charm for my application.

Azure Databricks: Python parallel for loop

Question

1 answers

solution1
0 ACCPTED 2022-01-28 18:41:59

Azure Databricks: Python parallel for loop

Question

1 answers

solution1 0 ACCPTED 2022-01-28 18:41:59

solution1
0 ACCPTED 2022-01-28 18:41:59