简体   繁体   English

Kafka + Spark 流:单个作业中的多主题处理

[英]Kafka + spark streaming : Multi topic processing in single job

There are 40 topic in Kafka and written spark streaming job to process 5 table each. Kafka 中有 40 个主题,并编写了 Spark 流作业来处理 5 个表。 Only objective of spark streaming job is to read 5 kafka topic and write it into corresponding 5 hdfs path. Spark Streaming 作业的唯一目标是读取 5 个 kafka 主题并将其写入相应的 5 个 hdfs 路径。 Most of the time its working fine, but some time it writing the topic 1 data to other hdfs path.大多数时候它工作正常,但有时它会将主题 1 数据写入其他 hdfs 路径。

Kafka==> 火花流==> HDFS 流

Below is the code tried to archive one spark streaming job to process 5 topic and write it into corresponding hdfs, But this writing topic 1 data to HDFS 5 instead of HDFS 1.下面是尝试归档一个 Spark 流作业以处理 5 个主题并将其写入相应的 hdfs 的代码,但是这个将主题 1 数据写入 HDFS 5 而不是 HDFS 1。

Please provide your suggestion :请提供您的建议:

import java.text.SimpleDateFormat
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, TaskContext }
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka010._
import org.apache.kafka.common.serialization.StringDeserializer

object SparkKafkaMultiConsumer extends App {

  override def main(args: Array[String]) {
    if (args.length < 1) {
      System.err.println(s"""
        |Usage: KafkaStreams auto.offset.reset latest/earliest table1,table2,etc 
        |
        """.stripMargin)
      System.exit(1)
    }

    val date_today = new SimpleDateFormat("yyyy_MM_dd");
    val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
    val PATH_SEPERATOR = "/";

    import com.typesafe.config.ConfigFactory

    val conf = ConfigFactory.load("env.conf")
    val topicconf = ConfigFactory.load("topics.conf")


// Create context with custom second batch interval
val sparkConf = new SparkConf().setAppName("pt_streams")
val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
var kafka_topics="kafka.topics"



// Create direct kafka stream with brokers and topics
var topicsSet = topicconf.getString(kafka_topics).split(",").toSet
if(args.length==2 ) {
  print ("This stream job will process table(s) : "+ args(1)) 
  topicsSet=args {1}.split(",").toSet
}


val topicList = topicsSet.toList

val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> conf.getString("kafka.brokers"),
  "zookeeper.connect" -> conf.getString("kafka.zookeeper"),
  "group.id" -> conf.getString("kafka.consumergroups"),
  "auto.offset.reset" -> args { 0 },
  "enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "security.protocol" -> "SASL_PLAINTEXT")



val messages = KafkaUtils.createDirectStream[String, String](
  ssc,
  LocationStrategies.PreferConsistent,
  ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))

for (i <- 0 until topicList.length) {
  /**
   *  set timer to see how much time takes for the filter operation for each topics
   */
  val topicStream = messages.filter(_.topic().equals(topicList(i)))


  val data = topicStream.map(_.value())
  data.foreachRDD((rdd, batchTime) => {
    //        val data = rdd.map(_.value())
    if (!rdd.isEmpty()) {
      rdd.coalesce(1).saveAsTextFile(conf.getString("hdfs.streamoutpath") + PATH_SEPERATOR + topicList(i) + PATH_SEPERATOR + date_today.format(System.currentTimeMillis())
        + PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
    }
  })
}



 try{
     // After all successful processing, commit the offsets to kafka
    messages.foreachRDD { rdd =>
      val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
      messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
    }
  } catch {
    case e: Exception => 
      e.printStackTrace()
      print("error while commiting the offset")

  }
// Start the computation
ssc.start()
ssc.awaitTermination()

  }

}

You're better off using the HDFS connector for Kafka Connect.最好将HDFS 连接器用于 Kafka Connect。 It is open source and available standalone or as part of Confluent Platform .它是开源的,可独立使用或作为Confluent Platform 的一部分使用。 Simple configuration file to stream from Kafka topics to HDFS, and as a bonus it will create the Hive table for you if you have a schema for your data.用于从 Kafka 主题流式传输到 HDFS 的简单配置文件,如果您有数据架构,它还会为您创建 Hive 表。

You're re-inventing the wheel if you try to code this yourself;如果您尝试自己编写代码,您就是在重新发明轮子; it's a solved problem :)这是一个已解决的问题:)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM