简体   繁体   English

Spark Streaming 到 Kafka 性能下降

[英]Spark Streaming to Kafka slow performance

I'm writing a spark streaming job that reads data from Kafka, makes some changes to the records and sends the results to another Kafka cluster.我正在编写一个 Spark 流作业,它从 Kafka 读取数据,对记录进行一些更改并将结果发送到另一个 Kafka 集群。

The performance of the job seems very slow, the processing rate is about 70,000 records per second.作业的性能似乎很慢,处理速度约为每秒 70,000 条记录。 The sampling shows that 30% of the time is spent on reading data and processing it and the remaining 70% spent on sending data to the Kafka.抽样显示,30% 的时间花在读取数据和处理数据上,其余 70% 的时间花在向 Kafka 发送数据上。

I've tried to tweak the Kafka configurations, add memory, change batch intervals, but the only change that works is to add more cores.我尝试调整 Kafka 配置、添加内存、更改批处理间隔,但唯一有效的更改是添加更多内核。

profiler:分析器: 分析器快照

Spark job details: Spark 作业详细信息:

max.cores 30
driver memory 6G
executor memory 16G
batch.interval 3 minutes
ingres rate 180,000 messages per second

Producer Properties (I've tried different varations)生产者属性(我尝试过不同的变体)

def buildProducerKafkaProperties: Properties = {
  val producerConfig = new Properties
  producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, destKafkaBrokers)
  producerConfig.put(ProducerConfig.ACKS_CONFIG, "all")
  producerConfig.put(ProducerConfig.BATCH_SIZE_CONFIG, "200000")
  producerConfig.put(ProducerConfig.LINGER_MS_CONFIG, "2000")
  producerConfig.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "gzip")
  producerConfig.put(ProducerConfig.RETRIES_CONFIG, "0")
  producerConfig.put(ProducerConfig.BUFFER_MEMORY_CONFIG, "13421728")
  producerConfig.put(ProducerConfig.SEND_BUFFER_CONFIG, "13421728")
  producerConfig
}

Sending code发送代码

 stream
    .foreachRDD(rdd => {    
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

        rdd
          .map(consumerRecord => doSomething(consumerRecord))
          .foreachPartition(partitionIter => {
            val producer = kafkaSinkBroadcast.value    
            partitionIter.foreach(row => {
              producer.send(kafkaTopic, row)
              producedRecordsAcc.add(1)
            })

        stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      })

Versions版本

Spark Standalone cluster 2.3.1
Destination Kafka cluster 1.1.1
Kafka topic has 120 partitions 

Can anyone suggest how to increase sending throughput?谁能建议如何增加发送吞吐量?


Update Jul 2019 2019 年 7 月更新

size : 150k messages per second, each message has about 100 columns.大小:每秒 150k 条消息,每条消息大约有 100 列。

main settings :主要设置

spark.cores.max = 30 # the cores balanced between all the workers.
spark.streaming.backpressure.enabled = true
ob.ingest.batch.duration= 3 minutes 
 

I've tried to use rdd.repartition(30), but it made the execution slower by ~10%我尝试使用 rdd.repartition(30),但它使执行速度降低了 ~10%

Thanks谢谢

Try to use repartition as below -尝试使用如下重新分区 -

val numPartitons = ( Number of executors * Number of executor cores ) val numPartitons = ( 执行器数量 * 执行器核心数量 )

stream
    .repartition(numPartitons)
    .foreachRDD(rdd => {    
        val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges

        rdd
          .map(consumerRecord => doSomething(consumerRecord))
          .foreachPartition(partitionIter => {
            val producer = kafkaSinkBroadcast.value    
            partitionIter.foreach(row => {
              producer.send(kafkaTopic, row)
              producedRecordsAcc.add(1)
            })

        stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
      })

This will give you optimum performance.这将为您提供最佳性能。

Hope this will help.希望这会有所帮助。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM