Kafka主题分区和Spark执行器映射

Question

I am using spark streaming with kafka topic. 我正在使用kafka主题的spark spark。 topic is created with 5 partitions. 主题是使用5个分区创建的。 My all messages are published to the kafka topic using tablename as key. 我的所有消息都使用tablename作为密钥发布到kafka主题。 Given this i assume all messages for that table should goto the same partition. 鉴于此，我假设该表的所有消息都应该转到同一个分区。 But i notice in the spark log messages for same table sometimes goes to executor's node-1 and sometime goes to executor's node-2. 但是我注意到同一个表中的spark日志消息有时会转到执行程序的node-1，有时会转到执行程序的node-2。

I am running code in yarn-cluster mode using following command: 我使用以下命令在yarn-cluster模式下运行代码：

spark-submit --name DataProcessor --master yarn-cluster --files /opt/ETL_JAR/executor-log4j-spark.xml,/opt/ETL_JAR/driver-log4j-spark.xml,/opt/ETL_JAR/application.properties --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=driver-log4j-spark.xml" --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=executor-log4j-spark.xml" --class com.test.DataProcessor /opt/ETL_JAR/etl-all-1.0.jar

and this submission creates 1 driver lets say on node-1 and 2 executors on node-1 and node-2. 并且此提交创建了1个驱动程序，可以说节点1和节点1和节点2上的2个执行程序。

I don't want node-1 and node-2 executors to read the same partition. 我不希望node-1和node-2执行器读取同一个分区。 but this is happening 但这种情况正在发生

Also tried following configuration to specify consumer group but no difference. 还尝试了以下配置来指定使用者组但没有区别。

kafkaParams.put("group.id", "app1");

This is how we are creating the stream using createDirectStream method *Not through zookeeper. 这就是我们使用createDirectStream方法创建流的方式*不通过zookeeper。

    HashMap<String, String> kafkaParams = new HashMap<String, String>();
    kafkaParams.put("metadata.broker.list", brokers);
    kafkaParams.put("auto.offset.reset", "largest");
    kafkaParams.put("group.id", "app1");

        JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
                jssc, 
                String.class, 
                String.class,
                StringDecoder.class, 
                StringDecoder.class, 
                kafkaParams, 
                topicsSet
        );

Complete Code: 完整代码：

import java.io.Serializable;
import java.util.Arrays;
import java.util.HashMap;
import java.util.HashSet;
import java.util.Iterator;

import org.apache.commons.lang3.StringUtils;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.Function;
import org.apache.spark.api.java.function.VoidFunction;
import org.apache.spark.streaming.Durations;
import org.apache.spark.streaming.api.java.JavaDStream;
import org.apache.spark.streaming.api.java.JavaPairInputDStream;
import org.apache.spark.streaming.api.java.JavaStreamingContext;
import org.apache.spark.streaming.api.java.JavaStreamingContextFactory;
import org.apache.spark.streaming.kafka.KafkaUtils;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;

import kafka.serializer.StringDecoder;
import scala.Tuple2;

public class DataProcessor2 implements Serializable {
    private static final long serialVersionUID = 3071125481526170241L;

    private static Logger log = LoggerFactory.getLogger("DataProcessor");

    public static void main(String[] args) {
        final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);
        DataProcessorContextFactory3 factory = new DataProcessorContextFactory3();
        JavaStreamingContext jssc = JavaStreamingContext.getOrCreate(sparkCheckPointDir, factory);

        // Start the process
        jssc.start();
        jssc.awaitTermination();
    }

}

class DataProcessorContextFactory3 implements JavaStreamingContextFactory, Serializable {
    private static final long serialVersionUID = 6070911284191531450L;

    private static Logger logger = LoggerFactory.getLogger(DataProcessorContextFactory.class);

    DataProcessorContextFactory3() {
    }

    @Override
    public JavaStreamingContext create() {
        logger.debug("creating new context..!");

        final String brokers = ApplicationProperties.getProperty(Consts.KAFKA_BROKERS_NAME);
        final String topic = ApplicationProperties.getProperty(Consts.KAFKA_TOPIC_NAME);
        final String app = "app1";
        final String offset = ApplicationProperties.getProperty(Consts.KAFKA_CONSUMER_OFFSET, "largest");

        logger.debug("Data processing configuration. brokers={}, topic={}, app={}, offset={}", brokers, topic, app,
                offset);
        if (StringUtils.isBlank(brokers) || StringUtils.isBlank(topic) || StringUtils.isBlank(app)) {
            System.err.println("Usage: DataProcessor <brokers> <topic>\n" + Consts.KAFKA_BROKERS_NAME
                    + " is a list of one or more Kafka brokers separated by comma\n" + Consts.KAFKA_TOPIC_NAME
                    + " is a kafka topic to consume from \n\n\n");
            System.exit(1);
        }
        final String majorVersion = "1.0";
        final String minorVersion = "3";
        final String version = majorVersion + "." + minorVersion;
        final String applicationName = "DataProcessor-" + topic + "-" + version;
        // for dev environment
         SparkConf sparkConf = new SparkConf().setMaster("local[*]").setAppName(applicationName);
        // for cluster environment
        //SparkConf sparkConf = new SparkConf().setAppName(applicationName);
        final long sparkBatchDuration = Long
                .valueOf(ApplicationProperties.getProperty(Consts.SPARK_BATCH_DURATION, "10"));

        final String sparkCheckPointDir = ApplicationProperties.getProperty(Consts.SPARK_CHECKPOINTING_DIR);

        JavaStreamingContext jssc = new JavaStreamingContext(sparkConf, Durations.seconds(sparkBatchDuration));
        logger.debug("setting checkpoint directory={}", sparkCheckPointDir);
        jssc.checkpoint(sparkCheckPointDir);

        HashSet<String> topicsSet = new HashSet<String>(Arrays.asList(topic.split(",")));

        HashMap<String, String> kafkaParams = new HashMap<String, String>();
        kafkaParams.put("metadata.broker.list", brokers);
        kafkaParams.put("auto.offset.reset", offset);
        kafkaParams.put("group.id", "app1");

//          @formatter:off
            JavaPairInputDStream<String, String> messages = KafkaUtils.createDirectStream(
                    jssc, 
                    String.class, 
                    String.class,
                    StringDecoder.class, 
                    StringDecoder.class, 
                    kafkaParams, 
                    topicsSet
            );
//          @formatter:on
        processRDD(messages, app);
        return jssc;
    }

    private void processRDD(JavaPairInputDStream<String, String> messages, final String app) {
        JavaDStream<MsgStruct> rdd = messages.map(new MessageProcessFunction());

        rdd.foreachRDD(new Function<JavaRDD<MsgStruct>, Void>() {

            private static final long serialVersionUID = 250647626267731218L;

            @Override
            public Void call(JavaRDD<MsgStruct> currentRdd) throws Exception {
                if (!currentRdd.isEmpty()) {
                    logger.debug("Receive RDD. Create JobDispatcherFunction at HOST={}", FunctionUtil.getHostName());
                    currentRdd.foreachPartition(new VoidFunction<Iterator<MsgStruct>>() {

                        @Override
                        public void call(Iterator<MsgStruct> arg0) throws Exception {
                            while(arg0.hasNext()){
                                System.out.println(arg0.next().toString());
                            }
                        }
                    });
                } else {
                    logger.debug("Current RDD is empty.");
                }
                return null;
            }
        });
    }
    public static class MessageProcessFunction implements Function<Tuple2<String, String>, MsgStruct> {
        @Override
        public MsgStruct call(Tuple2<String, String> data) throws Exception {
            String message = data._2();
            System.out.println("message:"+message);
            return MsgStruct.parse(message);
        }

    }
    public static class MsgStruct implements Serializable{
        private String message;
        public static MsgStruct parse(String msg){
            MsgStruct m = new MsgStruct();
            m.message = msg;
            return m;
        }
        public String toString(){
            return "content inside="+message;
        }
    }

}

Answer 1

Using the DirectStream approach it's a correct assumption that messages sent to a Kafka partition will land in the same Spark partition. 使用DirectStream方法，正确的假设是发送到Kafka分区的消息将落在同一个Spark分区中。

What we cannot assume is that each Spark partition will be processed by the same Spark worker each time. 我们不能假设每个Spark分区每次都由同一个Spark工作程序处理。 On each batch interval, Spark task are created for each OffsetRange for each partition and sent to the cluster for processing, landing on some available worker. 在每个批处理间隔，为每个分区的每个OffsetRange创建Spark任务，并将其发送到集群进行处理，登陆某些可用的工作程序。

What you are looking for partition locality. 您正在寻找分区位置。 The only partition locality that the direct kafka consumer supports is the kafka host containing the offset range being processed in the case that you Spark and Kafka deployements are colocated; 直接kafka使用者支持的唯一分区位置是kafka主机，其中包含在Spark和Kafka部署位置共处的情况下正在处理的偏移范围; but that's a deployment topology that I don't see very often. 但这是我不经常看到的部署拓扑。

In case that your requirements dictate the need to have host locality, you should look into Apache Samza or Kafka Streams . 如果您的要求需要拥有主机位置，您应该查看Apache Samza或Kafka Streams 。

Answer 2

According to Spark Streaming + Kafka Integration Guide (Kafka broker version 0.10.0 or higher) , you can specify an explicit mapping of partitions to hosts . 根据Spark Streaming + Kafka Integration Guide（Kafka broker版本0.10.0或更高版本），您可以指定分区到主机的显式映射。

Assume you have two hosts(h1 and h2), and the Kafka topic topic-name has three partitions. 假设您有两个主机（h1和h2），Kafka主题topic-name有三个分区。 The following critical code will show you how to map a specified partition to a host in Java. 以下关键代码将向您展示如何使用Java将指定分区映射到主机。

Map<TopicPartition, String> partitionMapToHost = new HashMap<>();
// partition 0 -> h1, partition 1 and 2 -> h2
partitionMapToHost.put(new TopicPartition("topic-name", 0), "h1");
partitionMapToHost.put(new TopicPartition("topic-name", 1), "h2");
partitionMapToHost.put(new TopicPartition("topic-name", 2), "h2");
List<String> topicCollection = Arrays.asList("topic-name");
Map<String, Object> kafkaParams = new HasMap<>();
kafkaParams.put("bootstrap.servers", "10.0.0.2:9092,10.0.0.3:9092");
kafkaParams.put("group.id", "group-id-name");
kafkaParams.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
kafkaParams.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
JavaInputDStream<ConsumerRecord<String, String>> records = KafkaUtils.createDirectStream(jssc,
    LocationStrategies.PreferFixed(partitionMapToHost), // PreferFixed is the key
    ConsumerStrategies.Subscribe(topicCollection, kafkaParams));

You can also use LocationStrategies.PreferConsistent() , which distribute partitions evenly across available executors , and assure that a specified partition is only consumed by a specified executor. 您还可以使用LocationStrategies.PreferConsistent() ，它可以在可用的执行程序之间平均分配分区，并确保指定的分区仅由指定的执行程序使用。

Kafka主题分区和Spark执行器映射

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-04-10 11:03:32

解决方案2
3 2016-10-27 09:40:03

Kafka主题分区和Spark执行器映射

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-04-10 11:03:32

解决方案2 3 2016-10-27 09:40:03

解决方案1
3 已采纳 2016-04-10 11:03:32

解决方案2
3 2016-10-27 09:40:03