简体繁体 English

如何在Spark群集中分配任务？

[英]How are tasks distributed within a Spark cluster?

原文 2017-05-26 12:49:30 9 1 apache-spark/ machine-learning/ parallel-processing/ scikit-learn/ cluster-computing

So I have an input that consists in a dataset and several ML algorithms (with parameter tuning) using scikit-learn. 所以我有一个输入，包括数据集和几个ML算法（带参数调整）使用scikit-learn。 I have tried quite a few attempts on how to execute this as efficiently as possible but at this very moment I still don't have the proper infrastructure to assess my results. 我已经尝试了很多关于如何尽可能高效地执行此操作的尝试，但是在这个时刻我仍然没有适当的基础设施来评估我的结果。 However, I lack some background on this area and I need help to get things cleared up. 但是，我在这方面缺乏一些背景知识，我需要帮助才能解决问题。

Basically I want to know how the tasks are distributed in a way that exploits as much as possible all the available resources, and what is actually done implicitly (for instance by Spark) and what isn't. 基本上我想知道如何以尽可能多地利用所有可用资源的方式分配任务，以及实际上隐含地执行什么（例如通过Spark）以及什么不是。

This is my scenario: 这是我的情景：

I need to train many different Decision Tree models (as many as the combination of all possible parameters), many different Random Forest models, and so on... 我需要训练许多不同的决策树模型（尽可能多的所有可能参数的组合），许多不同的随机森林模型，等等......

In one of my approaches, I have a list and each of its elements corresponds to one ML algorithm and its list of parameters. 在我的一种方法中，我有一个列表，其每个元素对应一个ML算法及其参数列表。

spark.parallelize(algorithms).map(lambda algorihtm: run_experiment(dataframe, algorithm))

In this function run_experiment I create a GridSearchCV for the corresponding ML algorithm with its parameter grid. 在这个函数run_experiment我为相应的ML算法创建了一个GridSearchCV及其参数网格。 I also set n_jobs=-1 in order to (try to) achieve maximum parallelism. 我还设置了n_jobs=-1以便（尝试）实现最大并行度。

In this context, on my Spark cluster with a few nodes, does it make sense that the execution would look somewhat like this? 在这种情况下，在我的带有几个节点的Spark集群上，执行看起来有点像这样有意义吗？

Or there can be one Decision Tree model and also one Random Forest model running in the same node? 或者可以在同一节点中运行一个决策树模型和一个随机森林模型？ This is my first experience using a cluster environment so I am a bit confused on how to expect things to work. 这是我第一次使用集群环境的经历，所以我对如何期望工作有点困惑。

On the other hand, what exactly changes in terms of execution, if instead of the first approach with parallelize , I use a for loop to sequentially iterate through my list of algorithms and create the GridSearchCV using databricks's spark-sklearn integration between Spark and scikit-learn? 另一方面，在执行方面究竟有什么变化，如果不使用parallelize化的第一种方法，我使用for循环来顺序遍历我的算法列表并使用数据库的Spark和scikit之间的spark-sklearn集成创建GridSearchCV学习？ The way it's illustrated in the documentation it seems something like this: 它在文档中的说明方式似乎是这样的：

Finally, with regards to this second approach, using the same ML algorithms but instead with Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of? 最后，关于第二种方法，使用相同的ML算法，而不是使用Spark MLlib而不是scikit-learn，整个并行化/分配是否会得到解决？

Sorry if most of this is a bit naive, but I really appreciate any answers or insights on this. 对不起，如果大多数这有点天真，但我真的很感激任何答案或见解。 I wanted to understand the basics before actually testing in the cluster and playing with task scheduling parameters. 我想在集群中进行实际测试并使用任务调度参数之前了解基础知识。

_{I am not sure whether this question is more suitable here or on CS stackexchange.} _{我不确定这个问题在这里或CS stackexchange上是否更合适。}

1 个解决方案

spark.parallelize(algorithms).map(...)

From the ref , "The elements of the collection are copied to form a distributed dataset that can be operated on in parallel." 从ref ，“复制集合的元素以形成可以并行操作的分布式数据集。” That means that your algorithms are going to be scattered among your nodes. 这意味着您的算法将分散在您的节点中。 From there, every algorithm will execute. 从那里，每个算法都将执行。

Your scheme could be valid, if the algorithms and their respective parameters were scattered that way, which I think is the case for you. 如果算法及其各自的参数分散，那么你的方案可能是有效的，我认为这是你的情况。

About using all your resources, spark is very good at this. 关于使用所有资源，火花非常擅长。 However, you need to check that the workload is balanced among your tasks (every task to do the same amount of work), in order to get good performance. 但是，您需要检查工作负载是否在您的任务之间平衡（每个任务执行相同的工作量），以获得良好的性能。

What changes if instead of the first approach with parallelize , I use a for loop? 如果不使用parallelize化的第一种方法，我使用for循环会有什么变化？

Everything. 一切。 Your dataset (algorithms in your case) is not an RDD, thus no parallel execution occurs. 您的数据集（在您的情况下算法）不是RDD，因此不会发生并行执行。

.. and also using databricks's spark-sklearn integration between Spark and scikit-learn? ..还使用databricks的spark和skikarn集成Spark和scikit-learn？

This article describes how Random Forests are implemented there: 本文介绍如何在那里实现随机森林：

"The scikit-learn package for Spark provides an alternative implementation of the cross-validation algorithm that distributes the workload on a Spark cluster. Each node runs the training algorithm using a local copy of the scikit-learn library, and reports the best model back to the master." “Spark的scikit-learn包提供了交叉验证算法的替代实现，该算法在Spark集群上分配工作负载。每个节点使用scikit-learn库的本地副本运行训练算法，并报告最佳模型对主人说。“

We can generalize this to all your algorithms, which make your scheme reasonable. 我们可以将此推广到您的所有算法，这使您的方案合理。

Spark MLlib instead of scikit-learn, would the whole parallelization/distribution be taken care of? Spark MLlib而不是scikit-learn，整个并行化/分发会被照顾吗？

Yes, it would. 是的，它会的。 They idea of both of this library is to take care things for us, so that we make our lives easier. 他们认为这个图书馆都是为我们照顾好事情，这样我们才能让生活更轻松。

_{I would advise you to ask one big question at a time, since the answer is too broad now, but I will try to be laconic.} _{我建议你一次提出一个大问题，因为答案现在太宽泛了，但我会尽量简洁。}