简体   繁体   English

spark-streaming-kafka-0-10:如何限制Spark分区的数量

[英]spark-streaming-kafka-0-10: How to limit number of Spark partitions

Is it possible to configure Spark with the spark-streaming-kafka-0-10 library to read multiple Kafka partitions or an entire Kafka topic with a single task instead of creating a different Spark task for every Kafka partition available? 是否可以使用spark-streaming-kafka-0-10库配置Spark以通过单个任务读取多个Kafka分区或整个Kafka主题,而不是为每个可用的Kafka分区创建不同的Spark任务?

Please excuse my rough understanding of these technologies; 请原谅我对这些技术的粗略了解; I think I'm still new to Spark and Kafka. 我认为我还是Spark和Kafka的新手。 The architecture and settings are mostly just messing around to explore and see how these technologies work together. 架构和设置大多只是弄乱了,以探索和查看这些技术如何协同工作。

I have a four virtual hosts, one with a Spark master and each with a Spark worker. 我有四台虚拟主机,一台具有Spark主服务器,每台具有Spark工作者。 One of the hosts is also running a Kafka broker, based on Spotify's Docker image . 其中一台主机还在运行基于Spotify的Docker映像的Kafka代理。 Each host has four cores and about 8 GB of unused RAM. 每个主机都有四个内核和大约8 GB的未使用RAM。

The Kafka broker has 206 topics, and each topic has 10 partitions. Kafka代理有206个主题,每个主题有10个分区。 So there are a total of 2,060 partitions for applications to read from. 因此,总共有2,060个分区可供应用程序读取。

I'm using the spark-streaming-kafka-0-10 library (currently experimental) to subscribe to topics in Kafka from a Spark Streaming job. 我正在使用spark-streaming-kafka-0-10库(当前处于实验阶段),通过Spark Streaming作业订阅Kafka中的主题。 I am using the SubscribePattern class to subscribe to all 206 topics from Spark: 我正在使用SubscribePattern类从Spark订阅所有206个主题:

val stream = KafkaUtils.createDirectStream[String, String](
  ssc,
  PreferConsistent,
  SubscribePattern[String, String](Pattern.compile("(pid\\.)\\d+"),
  kafkaParams)
)

When I submit this job to the Spark master, it looks like 16 executors are started, one for each core in the cluster. 当我将此作业提交给Spark主数据库时,看起来好像启动了16个执行程序,集群中每个核心都有一个执行程序。 It also looks like each Kafka partition gets its own task, for a total of 2,060 tasks. 看起来每个Kafka分区都有自己的任务,总共2060个任务。 I think my cluster of 16 executors is having trouble churning through so many tasks because the job keeps failing at different points between 1500 and 1800 tasks completed. 我认为我的16名执行者集群在完成如此多的任务时遇到了麻烦,因为工作在完成的1500到1800个任务之间的不同点上总是失败。

I found a tutorial by Michael Noll from 2014 which addresses using the spark-streaming-kafka-0-8 library to control the number of consumer threads for each topic: 我找到了Michael Noll于2014年撰写的教程,该教程使用spark-streaming-kafka-0-8库控制每个主题的使用者线程数:

val kafkaParams: Map[String, String] = Map("group.id" -> "terran", ...)

val consumerThreadsPerInputDstream = 3
val topics = Map("zerg.hydra" -> consumerThreadsPerInputDstream)
val stream = KafkaUtils.createStream(ssc, kafkaParams, topics, ...)

Is it possible to configure Spark with the spark-streaming-kafka-0-10 library to read multiple Kafka partitions or an entire Kafka topic with a single task instead of creating a different Spark task for every Kafka partition available? 是否可以使用spark-streaming-kafka-0-10库配置Spark以使用单个任务读取多个Kafka分区或整个Kafka主题,而不是为每个可用的Kafka分区创建不同的Spark任务?

You could alter the number of generated partitions by calling repartition on the stream, but then you lose the 1:1 correspondence between Kafka and RDD partition. 您可以通过在流上调用repartition来更改生成的分区的数量,但是随后您会丢失Kafka和RDD分区之间的1:1对应关系。

The number of tasks generated by Kafka partitions aren't related to the fact you have 16 executors. Kafka分区生成的任务数量与您有16个执行程序的事实无关。 The number of executors depend on your settings and the resource manager you're using. 执行程序的数量取决于您的设置和使用的资源管理器。

There is a 1:1 mapping between Kafka partitions and RDD partitions with the direct streaming API, each executor will get a subset of these partitions to consume from Kafka and process where each partition is independent and can be computed on it's own. 使用直接流式API,Kafka分区和RDD分区之间存在1:1映射,每个执行者将从Kafka中获取这些分区的子集以进行消费,并处理每个分区都是独立的并可单独计算的过程。 This is unlike the receiver based API which creates a single receiver on an arbitrary executor and consumes the data itself via threads on the node. 这与基于接收器的API不同,后者在任意执行程序上创建单个接收器,并通过节点上的线程消耗数据本身。

If you have 206 topics and 10 partitions each, you better have a decent sized cluster which can handle the load of the generated tasks. 如果您有206个主题,每个主题有10个分区,则最好有一个大小合适的集群,该集群可以处理所生成任务的负载。 You can control the max messages generated per partition, but you can alter the number of partitions unless you're will to call the shuffling effect of the repartition transformation. 您可以控制每个分区生成的最大消息数量,但是可以更改分区数量,除非您要调用repartition转换的混洗效果。

The second approach will be the best for your requirements. 第二种方法将是最适合您要求的方法。 Only you have to set consumerThreadsPerInputDstream = 1. So only one thread will be created per read operations, hence single machine will be involved per cluster. 只需要设置consumerThreadsPerInputDstream = 1即可。因此,每个读取操作将只创建一个线程,因此每个群集将涉及一台机器。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM