![](/img/trans.png)
[英]How do I save the result of a parallel process from joblib in Python?
[英]How do I submit multiple Spark jobs in parallel using Python and joblib?
如何使用 Python 的joblib
库并行提交多个 Spark 作业?
我还想在每个作业中进行“保存”或“收集”,因此我需要在作业之间重用相同的 Spark 上下文。
这是一个并行运行多个独立 Spark 作业而不等待第一个完成的示例。
其他方法
一个警告是必须设置 Spark FAIR 调度。
该解决方案使用线程而不是不同的进程,以便
from pyspark.sql.functions import udf, col, mean
from pyspark.sql.types import IntegerType, LongType
from joblib import Parallel, delayed
import pandas as pd
import random
lst = list(range(10, 100))
# Define functions operate on a single value from a column
def multiply(a):
return a * random.randint(10, 100)
def foo(i):
# This is the key point here, many different spark collect/save/show can be run here
# This is the function that parallelizing can help to speed up multiple independent jobs
return spark.createDataFrame(range(0, i), LongType()).select(mean(multiply(col("value"))).alias("value"))
parallel_job_count = 10
# Use "threads" to allow the same spark object to be reused between the jobs.
results = Parallel(n_jobs=parallel_job_count, prefer="threads")(delayed(foo)(i) for i in lst)
# Collect and print the results
mean_of_means = pd.concat([result.toPandas() for result in results]).value.mean()
print(f"Mean of Means: {mean_of_means}")
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.