简体   繁体   English

使用Apache Spark指定特定计算机

[英]Designate a specific machine with Apache Spark

I'm totally new to Spark and Hadoop-type stuff in general, so forgive me if this is a painfully basic question. 一般来说,我对Spark和Hadoop类型的东西是完全陌生的,如果这是一个令人痛苦的基本问题,请原谅我。 I'm trying to design a system that will make use of a cluster of some number of machines to do the first tasks in a series of tasks. 我正在尝试设计一个系统,该系统将使用一些计算机的集群来执行一系列任务中的第一个任务。 The follow-up tasks, which run on the RDDs that the first tasks generate, must all be done on the same machine. 在第一个任务生成的RDD上运行的后续任务必须全部在同一台计算机上完成。 This could be any machine from the cluster as long as it's always that machine for the duration of the program run. 这可以是群集中的任何计算机,只要在程序运行期间始终是该计算机即可。

How do I make sure that happens? 我如何确定会发生这种情况? Can I reserve a single machine in the cluster and always run the follow-up tasks on that machine? 我可以在集群中保留一台计算机,并始终在该计算机上运行后续任务吗? If so, how does that look in Java? 如果是这样,在Java中看起来如何? If not, is there some other way to accomplish this? 如果没有,还有其他方法可以做到这一点吗?

In general, no. 一般来说,没有。 Spark, like Hadoop, is designed to distribute tasks more or less arbitrarily, over the available nodes, and assumes all available nodes are equivalent for its purposes. 与Hadoop一样,Spark旨在在可用节点上或多或少地任意分配任务,并假定所有可用节点在其目的上都是等效的。 None of them will be treated specially. 它们都不会被特别对待。

If you don't want the second half of the process to run in a (more or less) massively parallel fashion, you probably don't want to use a parallel processing framework for that half of the job. 如果您不希望该过程的后半部分以(或多或少)大规模并行的方式运行,那么您可能不希望对后半部分使用并行处理框架。 Maybe you should write all the data from the parallel calculations to disk somewhere, and then run the second half of the job on the data, not as Spark RDD transformations, but just as normal Scala code which reads the files and processes them. 也许您应该将并行计算中的所有数据写入磁盘中的某个位置,然后在数据上运行作业的后半部分,而不是作为Spark RDD转换,而是像读取文件并对其进行处理的普通Scala代码那样。 It's hard to say. 很难说。

Why do all the "follow-up tasks" need to run in one particular place? 为什么所有“跟进任务”都需要在一个特定的位置运行? If you can explain more about this need, maybe someone can make good suggestions for you. 如果您可以解释有关此需求的更多信息,也许有人可以为您提供好的建议。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM