简体   繁体   English

FAIR调度模式能否使Spark Streaming作业从并行运行的不同主题读取?

[英]Could FAIR scheduling mode make Spark Streaming jobs that read from different topics running in parallel?

I use Spark 2.1 and Kafka 0.9. 我使用Spark 2.1和Kafka 0.9。

Under fair sharing, Spark assigns tasks between jobs in a “round robin” fashion, so that all jobs get a roughly equal share of cluster resources. 在公平共享下,Spark以“循环”方式在作业之间分配任务,以便所有作业都获得大致相等的群集资源份额。 This means that short jobs submitted while a long job is running can start receiving resources right away and still get good response times, without waiting for the long job to finish. 这意味着在运行长作业时提交的短作业可以立即开始接收资源,并且仍然获得良好的响应时间,而无需等待长作业完成。

According to this if i have multiple jobs from multiple threads in case of spark streaming(one topic from each thread) is it possible that multiple topics can run simultaneously if i have enough cores in my cluster or would it just do a round robin across pools but run only one job at a time ? 据此,如果我在火花流传输的情况下有多个线程的多个作业(每个线程一个主题),如果我的集群中有足够的核心,那么多个主题可以同时运行,还是只是跨池进行循环?但是一次只做一份工作?

Context: 语境:

I have two topics T1 and T2, both with one 1 partition. 我有两个主题T1和T2,都带有一个1分区。 I have configured a pool with scheduleMode to be FAIR. 我已经将scheduleMode配置为FAIR。 I have 4 cores registered with spark. 我有4个用spark注册的核心。 Now each topic has two actions(hence two jobs - totally 4 jobs across topics) Let's say J1 and J2 are jobs for T1 and J3 and J4 are jobs for topic T2. 现在,每个主题都有两个动作(因此有两个任务-整个主题中共有4个任务)假设J1和J2是T1的任务,而J3和J4是主题T2的任务。 What spark is doing in FAIR mode is execute J1 J3 J2 J4, but at any time only one job is executing. 在FAIR模式下,火花正在执行J1 ​​J3 J2 J4,但在任何时候都仅执行一个作业。 Now as each topic has only one partition, only once core is being used and 3 are just free. 现在,由于每个主题只有一个分区,因此只有一次使用了核心,而只有3个是免费的。 This is something which i don't want. 这是我不想要的东西。 Any way i can avoid this ? 我有什么办法可以避免这种情况?

if i have multiple jobs from multiple threads...is it possible that multiple topics can run simultaneously 如果我有来自多个线程的多个作业...是否有可能同时运行多个主题

Yes. 是。 That's the purpose of FAIR scheduling mode. 这就是FAIR调度模式的目的。

As you may have noticed, I removed "Spark Streaming" from your question since it does not contribute in any way to how Spark schedules Spark jobs. 您可能已经注意到,我从您的问题中删除了“火花流”,因为它对Spark如何安排Spark作业没有任何帮助。 It does not really matter whether you start your Spark jobs from a "regular" application or Spark Streaming one. 从“常规”应用程序或Spark Streaming应用程序启动Spark作业并不重要。

Quoting Scheduling Within an Application (highlighting mine): 在应用程序中引用调度 (突出显示):

Inside a given Spark application (SparkContext instance), multiple parallel jobs can run simultaneously if they were submitted from separate threads. 在给定的Spark应用程序(SparkContext实例)中,如果多个并行作业是从单独的线程提交的,则它们可以同时运行。

By default, Spark's scheduler runs jobs in FIFO fashion. 默认情况下,Spark的调度程序以FIFO方式运行作业。 Each job is divided into "stages" (eg map and reduce phases), and the first job gets priority on all available resources while its stages have tasks to launch, then the second job gets priority, etc. 每个作业都分为“阶段”(例如,映射和简化阶段),第一个作业在所有可用资源上都具有优先级,而其各个阶段都有要启动的任务,然后第二个作业具有优先级,依此类推。

And then the quote you used to ask the question that should now get clearer. 然后,您用来问这个问题的报价现在应该变得更清楚了。

it is also possible to configure fair sharing between jobs. 也可以配置作业之间的公平共享。 Under fair sharing, Spark assigns tasks between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of cluster resources. 在公平共享下,Spark以“循环”方式在作业之间分配任务,以便所有作业都获得大致相等的群集资源份额。

So, speaking about Spark Streaming you'd have to configure FAIR scheduling mode and Spark Streaming's JobScheduler should submit Spark jobs per topic in parallel (haven't tested it out myself so it's more theory than practice). 因此,谈到Spark Streaming,您必须配置FAIR调度模式,并且Spark Streaming的JobScheduler应该并行提交每个主题的Spark作业(还没有自己测试过,所以它比理论上要实用得多)。

I think that fair scheduler alone will not help, as it's the Spark Streaming engine that takes care of submitting the Spark Jobs and normally does so in a sequential mode. 我认为仅凭公平的调度程序就无济于事,因为Spark Streaming引擎负责提交Spark Jobs,并且通常以顺序模式进行。

There's a non-documented configuration parameter in Spark Streaming: spark.streaming.concurrentJobs [1] , which is set to 1 by default. Spark Streaming中有一个未记录的配置参数: spark.streaming.concurrentJobs [1] ,默认情况下设置为1 It controls the parallelism level of jobs submitted to Spark. 它控制提交给Spark的作业的并行度。

By increasing this value, you may see parallel processing of the different spark stages of your streaming job. 通过增加该值,您可能会看到并行处理流作业的不同火花阶段。

I would think that combining this configuration with the fair scheduler in Spark, you will be able to achieve controlled parallel processing of the independent topic consumers. 我认为,将此配置与Spark中的公平调度程序结合使用,您将能够实现独立主题使用者的受控并行处理。 This is mostly uncharted territory. 这主要是未知领域。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何在Spark流中运行并发活动作业以及执行者之间的公平任务调度 - How to run concurrent Active Jobs in Spark Streaming and fair task scheduling among executors 使用 Spark Structured Streaming 从多个 Kafka 主题读取并写入不同接收器的最佳方式是什么? - What is the optimal way to read from multiple Kafka topics and write to different sinks using Spark Structured Streaming? Spark 公平调度不起作用 - Spark fair scheduling not working Spark Structure Streaming 无法从 Kafka 主题读取消息 - Spark Structure Streaming not able to read message from Kafka topics 阅读多个主题并写入单个主题-Spark Streaming - Read from multiple topics and write to single topic - Spark Streaming 执行三个并行Spark Streaming作业 - Execution of three parallel Spark Streaming jobs 通过 Airflow 调度在 Kubernetes 上运行的 Spark 作业 - Scheduling Spark Jobs Running on Kubernetes via Airflow 如何并行地对多个Spark作业执行多个Kafka主题 - How to do multiple Kafka topics to multiple Spark jobs in parallel Spark Streaming 并行处理不同的 DStream 并线性处理一个 DStream 中的作业 - Spark Streaming process different DStreams in parallel and process jobs within one DStream linearly Spark Streaming无法阅读Kafka主题 - Spark Streaming not reading from Kafka topics
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM