简体   繁体   English

Spark Streaming - 加入多个 kafka 流操作很慢

[英]Spark Streaming - Join on multiple kafka stream operation is slow

I have 3 kafka streams having 600k+ records each, spark streaming takes more than 10 mins to process simple joins between streams.我有 3 个 kafka 流,每个流有 60 万多条记录,火花流需要 10 多分钟来处理流之间的简单连接。

Spark Cluster config:星火集群配置:

Spark 主界面

This is how i'm reading kafka streams to tempviews in spark(scala)这就是我在 spark(scala) 中读取 kafka 流到临时视图的方式

spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "KAFKASERVER")
.option("subscribe", TOPIC1)
.option("startingOffsets", "earliest")
.option("endingOffsets", "latest").load()
.selectExpr("CAST(value AS STRING) as json")
.select( from_json($"json", schema=SCHEMA1).as("data"))
.select($"COL1", $"COL2")
.createOrReplaceTempView("TABLE1")

I join 3 TABLES using spark spark sql我使用 spark spark sql 加入了 3 个 TABLES

select COL1, COL2 from TABLE1   
JOIN TABLE2 ON TABLE1.PK = TABLE2.PK
JOIN TABLE3 ON TABLE2.PK = TABLE3.PK

Execution of Job:工作执行:

工作界面

Am i missing out some configuration on spark that i've to look into?我是否错过了一些我必须研究的 spark 配置?

I find the same problem.我发现同样的问题。 And I found join between stream and stream needs more memory as I image.我发现流和流之间的连接在我成像时需要更多内存。 And the problem disappear when I increase the cores per executor.当我增加每个执行程序的核心数时,问题就会消失。

unfortunately there wasn't any test data nor the result data that you expected to be so I could play with, so I cannot give the exact proper answer.不幸的是,没有任何测试数据,也没有你期望的结果数据,所以我可以玩,所以我不能给出确切的正确答案。

@Asteroid comment is valid, as we see the number of task for each stage is 1. Normally Kafka stream use receiver to consume the topic; @Asteroid 注释是有效的,因为我们看到每个阶段的任务数为 1。通常 Kafka 流使用接收器来消费主题; and each receiver only create one tasks.每个接收者只创建一个任务。 One approach is to use multiple receivers / split partition / Increase your resources (# of core) to increase parallelism.一种方法是使用多个接收器/拆分分区/增加资源(核心数量)来增加并行度。

If this still not working, another way is to use Kafka API to createDirectStream.如果这仍然不起作用,另一种方法是使用 Kafka API 来创建 DirectStream。 According to the documentation https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html , this one creates an input stream that directly pulls messages from Kafka Brokers without using any receiver.根据文档https://spark.apache.org/docs/1.4.0/api/java/org/apache/spark/streaming/kafka/KafkaUtils.html ,这个创建了一个直接从 Kafka 拉取消息的输入流经纪人不使用任何接收器。

I premilinary crafted a sample code for creating direct stream below.我在下面初步制作了一个用于创建直接流的示例代码。 You might want to learn about this to customize to you own preference.您可能想了解这一点以根据自己的喜好进行自定义。

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.streaming.kafka010._
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "KAFKASERVER",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "use_a_separate_group_id_for_each_stream",
  "startingOffsets" -> "earliest",
  "endingOffsets" -> "latest"
)

val topics = Array(TOPIC1)
val stream = KafkaUtils.createDirectStream[String, String](
  streamingContext,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams)
)
val schema = StructType(StructField('data', StringType, True))
val df = spark.createDataFrame([], schema)
val dstream = stream.map(_.value())
dstream.forEachRDD(){rdd:RDD[String], time:Time} => {
    val tdf = spark.read.schema(schema).json(rdd)
    df = df.union(tdf)
    df.createOrReplaceTempView("TABLE1")
}

Some related materials:一些相关资料:

https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2/ (Scroll down to Kafka Consumer Code portion. The other section is irrelevant) https://mapr.com/blog/monitoring-real-time-uber-data-using-spark-machine-learning-streaming-and-kafka-api-part-2/ (向下滚动到 Kafka 消费者代码部分。其他部分无关紧要)

https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (Spark Doc for create direct stream) https://spark.apache.org/docs/latest/streaming-kafka-0-10-integration.html (用于创建直接流的 Spark Doc)

Good luck!祝你好运!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM