[英]spark streaming with kafka one consumer is reading the data
I am using the spark steaming using the kafka i have a topic with 20 partitions. 我正在使用kafka进行火花蒸煮,我有一个包含20个分区的主题。 When streaming job runs only one consumer is reading the data from all the topics which leads to slow in reading the data. 当流作业运行时,只有一个使用者从所有主题中读取数据,这导致读取数据的速度变慢。 Is there a way can we configure one consumer per partion in spark steaming. 有没有办法在火花蒸汽中为每个分区配置一个消费者。
JavaStreamingContext jsc = AnalyticsContext.getInstance().getSparkStreamContext();
Map<String, Object> kafkaParams = MessageSessionFactory.getConsumerConfigParamsMap(MessageSessionFactory.DEFAULT_CLUSTER_IDENTITY, consumerGroup);
String[] topics = topic.split(",");
Collection<String> topicCollection = Arrays.asList(topics);
metricStream = KafkaUtils.createDirectStream(
jsc,
LocationStrategies.PreferConsistent(),
ConsumerStrategies.Subscribe(topicCollection, kafkaParams)
);
}
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
metric_data_spark 16 3379403197 3379436869 33672 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 7 3399030625 3399065857 35232 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 13 3389008901 3389044210 35309 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 17 3380638947 3380639928 981 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 1 3593201424 3593236844 35420 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 8 3394218406 3394252084 33678 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 19 3376897309 3376917998 20689 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 3 3447204634 3447240071 35437 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 18 3375082623 3375083663 1040 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 2 3433294129 3433327970 33841 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 9 3396324976 3396345705 20729 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 0 3582591157 3582624892 33735 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 14 3381779702 3381813477 33775 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 4 3412492002 3412525779 33777 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 11 3393158700 3393179419 20719 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 10 3392216079 3392235071 18992 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 15 3383001380 3383036803 35423 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 6 3398338540 3398372367 33827 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 12 3387738477 3387772279 33802 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
metric_data_spark 5 3408698217 3408733614 35397 consumer-2-da278f31-c368-414c-925b-d3ca4881709e /xx.xx.xx.xx consumer-2
What changes we need to do make one consumer/per partition to read the data. 我们需要做的更改使每个使用者/每个分区读取数据。
Since you are using the consistent placement strategy, it should distribute over executors 由于您使用的是一致的展示位置策略,因此应将其分配给执行者
When you run a Spark submit, you need to specify that you want at most 20 executors to be started. 运行Spark提交时,需要指定最多要启动20个执行程序。 --num-executors 20
If you do more than that, though, you'll have idle executors not consuming Kafka data (but they might still be able to process other stages) 但是,如果您做得更多,那么您将有闲置的执行程序不使用Kafka数据(但他们仍然可以处理其他阶段)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.