简体   繁体   English

Spark Kafka 流不会在工作节点上分配消费者负载

[英]Spark Kafka streaming doesn't distribute consumer load on worker nodes

I've created the following application that prints specific messages occurrences within 20sec windows:我创建了以下应用程序,可在 20 秒窗口内打印特定的消息事件:

public class SparkMain {

public static void main(String[] args) {
    Map<String, Object> kafkaParams = new HashMap<>();

    kafkaParams.put(BOOTSTRAP_SERVERS_CONFIG, "localhost:9092, localhost:9093");
    kafkaParams.put(GROUP_ID_CONFIG, "spark-consumer-id");
    kafkaParams.put(KEY_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
    kafkaParams.put(VALUE_DESERIALIZER_CLASS_CONFIG, StringDeserializer.class.getName());
    // events topic has 2 partitions
    Collection<String> topics = Arrays.asList("events");

    // local[*] Run Spark locally with as many worker threads as logical cores on your machine.
    SparkConf conf = new SparkConf().setMaster("local[*]").setAppName("SsvpSparkStreaming");

    // Create context with a 1 seconds batch interval
    JavaStreamingContext streamingContext =
            new JavaStreamingContext(conf, Durations.seconds(1));

    JavaInputDStream<ConsumerRecord<String, String>> stream =
            KafkaUtils.createDirectStream(
                    streamingContext,
                    LocationStrategies.PreferConsistent(),
                    ConsumerStrategies.<String, String>Subscribe(topics, kafkaParams)
            );

    // extract event name from record value
    stream.map(new Function<ConsumerRecord<String, String>, String>() {
        @Override
        public String call(ConsumerRecord<String, String> rec) throws Exception {
            return rec.value().substring(0, 5);
        }})
    // filter events
    .filter(new Function<String, Boolean>() {
        @Override
        public Boolean call(String eventName) throws Exception {
            return eventName.contains("msg");
        }})
    // count with 20sec window and 5 sec slide duration
    .countByValueAndWindow(Durations.seconds(20), Durations.seconds(5))
    .print();

    streamingContext.checkpoint("c:\\projects\\spark\\");
    streamingContext.start();
    try {
        streamingContext.awaitTermination();
    } catch (InterruptedException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    }
}

After running the main method inside logs I see only single consumer initialization that gets both partitions:在日志中运行 main 方法后,我只看到获得两个分区的单个使用者初始化:

2018-10-25 18:25:56,007 INFO [org.apache.kafka.common.utils.LogContext$KafkaLogger.info] - <[Consumer clientId=consumer-1, groupId=spark-consumer-id] Setting newly assigned partitions [events-0, events-1]> 2018-10-25 18:25:56,007 INFO [org.apache.kafka.common.utils.LogContext$KafkaLogger.info] - <[Consumer clientId=consumer-1, groupId=spark-consumer-id] 设置新分配的分区[事件-0,事件-1]>

Isn't the number of consumers should be equal to the number of spark workers?消费者的数量不是应该等于spark worker的数量吗? In accordance with https://spark.apache.org/docs/2.3.2/submitting-applications.html#master-urls按照https://spark.apache.org/docs/2.3.2/submitting-applications.html#master-urls

local[*] means - Run Spark locally with as many worker threads as logical cores on your machine. local[*] 表示 -在本地运行 Spark,使用与机器上的逻辑内核一样多的工作线程。

I have 8 cores CPU, so I expect 8 consumers or at least 2 consumers should be created and each gets the partition of the 'events' topic(2 partitions).我有 8 核 CPU,所以我希望应该创建 8 个消费者或至少 2 个消费者,并且每个消费者都获得“事件”主题的分区(2 个分区)。

It seems to me that I need to run a whole standalone spark master-worker cluster with 2 nodes where each node starts its own consumer...在我看来,我需要运行一个完整的独立 spark master-worker 集群,其中包含 2 个节点,其中每个节点都启动自己的消费者......

You don't necessarily need separate workers or running a cluster manager.您不一定需要单独的工作人员或运行集群管理器。

Sounds like you're looking for using 2 Spark executors听起来您正在寻找使用 2 个 Spark 执行程序

How to set amount of Spark executors? 如何设置 Spark 执行程序的数量?

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 Spark Streaming Kafka Consumer - Spark Streaming Kafka Consumer Spark Streaming 中的 Kafka 消费者 - Kafka consumer in Spark Streaming 使用kafka进行火花流传输一位消费者正在读取数据 - spark streaming with kafka one consumer is reading the data 如何使用直接流在Kafka Spark Streaming中指定使用者组 - how to specify consumer group in Kafka Spark Streaming using direct stream 处理(Drop and Log) Kafka 生产者发布的坏数据,这样Spark (Java) Consumer 不会将其存储在HDFS 中 - Handle(Drop and Log) bad data published by Kafka producer , such that Spark (Java) Consumer doesn't store it in HDFS 如果某些Kafka节点时间偏移未同步,则Spark流式传输作业会停止 - Spark streaming job stuck if some Kafka nodes time offset is not synchronized Kafka使用者未加入自定义groupId - Kafka consumer doesn't join custom groupId Spark2 Kafka结构化流式Java不知道from_json函数 - Spark2 Kafka Structured Streaming Java doesn't know from_json function Java中的对象不可序列化(org.apache.kafka.clients.consumer.ConsumerRecord)spark kafka流 - Object not serializable (org.apache.kafka.clients.consumer.ConsumerRecord) in Java spark kafka streaming Kafka Spark Streaming Consumer是否不会从Kafka Console Producer接收任何消息? - Kafka Spark Streaming Consumer will not receive any messages from Kafka Console Producer?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM