简体   繁体   中英

How do I submit multiple Spark jobs in parallel using Python and joblib?

How do I submit multiple Spark jobs in parallel using Python's joblib library?

I also want to do a "save" or "collect" in every job so I need to reuse the same Spark Context between the jobs.

Here is an example to run multiple independent spark jobs in parallel without waiting for the first one to finish.

Other Approaches

  • Don't use multiprocessing as it can't pickle spark context.
  • Don't use Spark UDFs as spark jobs can't access the spark context

One caveat is that Spark FAIR scheduling must be set.

This solution uses threads instead of different processes so that

  1. spark context can be shared between the different threads
  2. local variables can be shared between the different threads

Code here to calculate the mean of means of some numbers

from pyspark.sql.functions import udf, col, mean
from pyspark.sql.types import IntegerType, LongType
from joblib import Parallel, delayed
import pandas as pd
import random

lst = list(range(10, 100))

# Define functions operate on a single value from a column
def multiply(a):
  return a * random.randint(10, 100)

def foo(i):
  # This is the key point here, many different spark collect/save/show can be run here
  # This is the function that parallelizing can help to speed up multiple independent jobs
  return spark.createDataFrame(range(0, i), LongType()).select(mean(multiply(col("value"))).alias("value"))

parallel_job_count = 10
# Use "threads" to allow the same spark object to be reused between the jobs.
results = Parallel(n_jobs=parallel_job_count, prefer="threads")(delayed(foo)(i) for i in lst)

# Collect and print the results
mean_of_means = pd.concat([result.toPandas() for result in results]).value.mean()
print(f"Mean of Means: {mean_of_means}")

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM