Pyspark并行执行多个作业

Question

I have the following situation with my Pyspark: 我的Pyspark遇到以下情况：

In my driver program (driver.py), I call a function from another file (prod.py) 在我的驱动程序（driver.py）中，我从另一个文件（prod.py）调用了一个函数

latest_prods = prod.featurize_prods().

Driver code: 驱动程序代码：

from Featurize import Featurize
from LatestProd import LatestProd
from Oldprod import Oldprod

sc = SparkContext()

if __name__ == '__main__':
    print 'Into main'

featurize_latest = Featurize('param1', 'param2', sc)

latest_prod = LatestProd(featurize_latest)
latest_prods = latest_prod.featurize_prods()
featurize_old = Featurize('param3', 'param3', sc)

old_prods = Oldprod(featurize_old)
old_prods = oldprod.featurize_oldprods()
total_prods =  sc.union([latest_prods, old_prods])

Then I do some some reduceByKey code here... that generates total_prods_processed . 然后我在这里做一些reduceByKey代码...生成total_prods_processed 。

Finally I call: 最后我打电话给：

total_prods_processed.saveAsTextFile(...)

I would like to generate latest_prods and old_prods in parallel. 我想并行生成latest_prods和old_prods。 Both are created in the same SparkContext . 两者都在同一个SparkContext中创建。 Is it possible to do that? 有可能这样做吗？ If not, how can I achieve that functionality? 如果没有，我如何实现该功能？

Is this something that does Spark automatically? 这是自动执行Spark的功能吗？ I am not seeing this behavior when I run the code so please let me know if it is a configuration option. 我在运行代码时没有看到此行为，因此请告知它是否是配置选项。

Answer 1

After searching on the internet, I think your problem can be addressed by threads. 在互联网上搜索后，我认为您的问题可以通过线程解决。 It is as simple as create two threads for your old_prod and latest_prod work. 这就像为old_prod和latest_prod工作创建两个线程一样简单。

Check this post for a simplified example. 查看此帖子以获取简化示例。 Since Spark is thread-safe, you gain the parallel efficiency without sacrifice anything. 由于Spark是线程安全的，因此您无需牺牲任何东西即可获得并行效率。

Answer 2

The short answer is no, you can't schedule operations on two distinct RDDs at the same time in the same spark context. 简短的答案是“不”，您不能在同一火花上下文中同时调度两个不同的RDD上的操作。 However there are some workarounds, you could process them in two distinct SparkContext on the same cluster and call SaveAsTextFile. 但是，有一些解决方法，您可以在同一群集上的两个不同的SparkContext中对其进行处理，然后调用SaveAsTextFile。 Then read both in another job to perform the union. 然后在另一个作业中阅读两者以执行合并。 (this is not recommended by the documentation). （文档不建议这样做）。 If you want to try this method, it is discussed here using spark-jobserver since spark doesn't support multiple context by default : https://github.com/spark-jobserver/spark-jobserver/issues/147 如果您想尝试此方法，此处将使用spark-jobserver进行讨论，因为spark默认情况下不支持多个上下文： https : //github.com/spark-jobserver/spark-jobserver/issues/147

However according to the operations you perform there would be no reason to process both at the same time since you need the full results to perform the union, spark will split those operations in 2 different stages that will be executed one after the other. 但是，根据您执行的操作，由于您需要完整的结果来执行合并，因此没有理由同时处理这两个操作，Spark会将这些操作分为两个不同的阶段，一个接一个地执行。

Pyspark并行执行多个作业

问题描述

2 个解决方案

解决方案1
1 2018-11-13 19:11:59

解决方案2
-1 2015-11-27 19:23:57

Pyspark并行执行多个作业

问题描述

2 个解决方案

解决方案1 1 2018-11-13 19:11:59

解决方案2 -1 2015-11-27 19:23:57

解决方案1
1 2018-11-13 19:11:59

解决方案2
-1 2015-11-27 19:23:57