简体   繁体   English

来自一个 Kafka 主题源的并发 Spark stream 作业

[英]Concurrent Spark stream job from one Kafka topic source

We have a simple spark stream from a kafka topic (with 8 partitions) created like below and submitted with 2 executors (4 cores each).我们有一个简单的火花 stream 来自一个 kafka 主题(有 8 个分区),如下所示创建并提交了 2 个执行程序(每个执行程序 4 个核心)。

dataSet
   .writeStream()
   .trigger(Trigger.ProcessingTime(0))
   .format("kafka");
   .start();

Now consider this scenario:现在考虑这种情况:

  1. One request comes to the partition #0 of this topic.一个请求来到此主题的分区 #0。
  2. A spark job will start with 8 tasks and only one of them is running (the others are success).一个 Spark 作业将从 8 个任务开始,其中只有一个正在运行(其他都是成功的)。
  3. Suppose it takes 1 minute to process this request.假设处理这个请求需要 1 分钟。
  4. During this 1 minute 100 requests comes to this topic (in all 8 partitions).在这 1 分钟内,有 100 个请求到达该主题(在所有 8 个分区中)。
  5. Spark waits for the current job to finish then it creates another job to process new requests. Spark 等待当前作业完成,然后创建另一个作业来处理新请求。

Our expectation is that Spark process other requests in another job while it's processing first request, but that's not happening.我们的期望是 Spark 在处理第一个请求时在另一个作业中处理其他请求,但这并没有发生。 Now suppose that first job takes 1 hour instead on 1 minute while the other requests waiting to be processed while 7 cores are idle.现在假设第一个作业需要 1 小时而不是 1 分钟,而其他请求在 7 个核心空闲时等待处理。 That's our problem.那是我们的问题。

I already tried to send this jobs multiple times (like 4 times) from 4 different threads but behavior is still the same.我已经尝试从 4 个不同的线程多次(如 4 次)发送此作业,但行为仍然相同。 Also I tried setting this config spark.streaming.concurrentJobs to more than 1 but no change!我还尝试将此配置spark.streaming.concurrentJobs设置为大于 1 但没有变化!

So my question is that is it possible to have multiple jobs for one kafka stream dataset at all?所以我的问题是,一个kafka stream 数据集是否可以有多个工作 And if yes how?如果是的话怎么办?

We are using Spark 2, Kafka 1 and Java 8.我们正在使用 Spark 2、Kafka 1 和 Java 8。

So after days of studying and testing I finally figured out that none of concurrentJob setting or sending jobs in different threads are not solutions.因此,经过几天的学习和测试,我终于发现,在不同线程中设置并发作业或发送作业都不是解决方案。

The only working solution is to create different streams for each (or group of) topic partitions .唯一可行的解决方案是为每个(或一组)主题分区创建不同的流

The parallelism factor in kafka is partition. kafka 中的并行性因素是分区。 And Spark (and kafka) has this ability to read only from specific partition(s). Spark(和 kafka)具有这种只能从特定分区读取的能力。 So if our partition has 4 topics I split my Spark job into 4 different jobs, each of them listening (assigning) to one partition but all of them are sinking into same target.因此,如果我们的分区有 4 个主题,我将 Spark 作业分成 4 个不同的作业,每个作业都在监听(分配)一个分区,但它们都沉入同一个目标。

So now if one job is busy with a time consuming process, the other jobs (here 3) can still process data from their assigned partitions and they don't need to wait for finishing process on the other partitions.所以现在如果一个作业忙于一个耗时的过程,其他作业(这里是 3)仍然可以处理来自其分配的分区的数据,他们不需要等待其他分区上的完成过程。

The config is like below:配置如下:

assign: {"topic-name":[0,1,2]}

instead of代替

subscribe: "topic-name"

Pay attention to config structure, it should be a valid JSON and list of topics should be mentioned in comma separated string (not supporting range or exclude)注意配置结构,它应该是一个有效的 JSON并且主题列表应该以逗号分隔的字符串提及(不支持范围或排除)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM