使用Spark StreamingContext从Kafka主题中消费

Question

I am brand new to Spark & Kafka and am trying to get some Scala code (running as a Spark job) to act as a long-running process (not just a short-lived/scheduled task) and to continuously poll a Kafka broker for messages. 我是Spark＆Kafka的新手，正在尝试获取一些Scala代码（作为Spark作业运行）以充当长期运行的过程（而不仅仅是短暂的/计划中的任务），并持续轮询Kafka经纪人以获取消息。 When it receives messages, I just want them printed out to the console/STDOUT. 当它收到消息时，我只希望将它们打印到控制台/ STDOUT。 Again, this needs to be a long-running process and basically (try to) live forever. 同样，这需要一个长期运行的过程，并且基本上（尝试）永远存在。

After doing some digging, it seems like a StreamingContext is what I want to use. 经过一些挖掘之后，似乎我想使用StreamingContext 。 Here's my best attempt: 这是我的最佳尝试：

import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.storage._
import org.apache.spark.streaming.{StreamingContext, Seconds, Minutes, Time}
import org.apache.spark.streaming.dstream._
import org.apache.spark.streaming.kafka._
import kafka.serializer.StringDecoder

def createKafkaStream(ssc: StreamingContext, kafkaTopics: String, brokers: String): DStream[(String, String)] = {
    val topicsSet = kafkaTopics.split(",").toSet
    val props = Map(
        "bootstrap.servers" -> "my-kafka.example.com:9092",
        "metadata.broker.list" -> "my-kafka.example.com:9092",
        "serializer.class" -> "kafka.serializer.StringEncoder",
        "value.serializer" -> "org.apache.kafka.common.serialization.StringSerializer",
        "value.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer",
        "key.serializer" -> "org.apache.kafka.common.serialization.StringSerializer",
        "key.deserializer" -> "org.apache.kafka.common.serialization.StringDeserializer"
    )
    KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc, props, topicsSet)
}

def processEngine(): StreamingContext = {
    val ssc = new StreamingContext(sc, Seconds(1))

    val topicStream = createKafkaStream(ssc, "mytopic", "my-kafka.example.com:9092").print()

    ssc
}

StreamingContext.getActive.foreach {
    _.stop(stopSparkContext = false)
}

val ssc1 = StreamingContext.getActiveOrCreate(processEngine)
ssc1.start()
ssc1.awaitTermination()

When I run this, I get no exceptions/errors, but nothing seems to happen. 运行此命令时，没有异常/错误，但似乎什么也没有发生。 I can confirm there are messages on the topic. 我可以确认有关于该主题的消息。 Any ideas as to where I'm going awry? 关于我要去哪里的任何想法？

Answer 1

When you foreachRDD , the output is printed in the Worker nodes , not the Master . 当您使用foreachRDD ，输出将打印在Worker节点中 ，而不是Master中 。 I'm assuming you're looking at the Master's console output. 我假设您正在查看主机的控制台输出。 You can use DStream.print instead: 您可以改用DStream.print ：

val ssc = new StreamingContext(sc, Seconds(1))
val topicStream = createKafkaStream(ssc, "mytopic", "my-kafka.example.com:9092").print()

Also, don't forget to call ssc.awaitTermination() after ssc.start() : 另外，不要忘记在ssc.awaitTermination()之后调用ssc.start() ：

ssc.start()
ssc.awaitTermination()

As a sidenote, I'm assuming you copy pasted this example, but there's no need to use transform on the DStream if you're not actually planning to do anything with the OffsetRange . 一点题外话，我假设你复制粘贴此示例中，但没有必要使用transform的DStream如果你没有真正打算做的任何事情OffsetRange 。

Answer 2

Is this your complete code? 这是您完整的代码吗？ where did you create sc? 您在哪里创建sc？ you have to create spark context before streaming context. 您必须先创建Spark上下文，然后再流式传输上下文。 you can create sc like this : 您可以这样创建sc：

SparkConf sc = new SparkConf().setAppName("SparkConsumer");

Also, without awaitTermination , it is very hard to catch and print exceptions that occur during the background data processing. 另外，如果没有awaitTermination ，则很难捕获和打印在后台数据处理期间发生的异常。 Can you add ssc1.awaitTermination(); 您可以添加ssc1.awaitTermination(); at the end and see if you get any error. 最后看看是否有任何错误。

使用Spark StreamingContext从Kafka主题中消费

问题描述

2 个解决方案

解决方案1
3 已采纳 2016-08-10 13:57:42

解决方案2
1 2016-08-10 14:36:12

使用Spark StreamingContext从Kafka主题中消费

问题描述

2 个解决方案

解决方案1 3 已采纳 2016-08-10 13:57:42

解决方案2 1 2016-08-10 14:36:12

解决方案1
3 已采纳 2016-08-10 13:57:42

解决方案2
1 2016-08-10 14:36:12