There are 40 topic in Kafka and written spark streaming job to process 5 table each. Only objective of spark streaming job is to read 5 kafka topic and write it into corresponding 5 hdfs path. Most of the time its working fine, but some time it writing the topic 1 data to other hdfs path.
Below is the code tried to archive one spark streaming job to process 5 topic and write it into corresponding hdfs, But this writing topic 1 data to HDFS 5 instead of HDFS 1.
Please provide your suggestion :
import java.text.SimpleDateFormat
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.{ SparkConf, TaskContext }
import org.apache.spark.streaming.{ Seconds, StreamingContext }
import org.apache.spark.streaming.kafka010._
import org.apache.kafka.common.serialization.StringDeserializer
object SparkKafkaMultiConsumer extends App {
override def main(args: Array[String]) {
if (args.length < 1) {
System.err.println(s"""
|Usage: KafkaStreams auto.offset.reset latest/earliest table1,table2,etc
|
""".stripMargin)
System.exit(1)
}
val date_today = new SimpleDateFormat("yyyy_MM_dd");
val date_today_hour = new SimpleDateFormat("yyyy_MM_dd_HH");
val PATH_SEPERATOR = "/";
import com.typesafe.config.ConfigFactory
val conf = ConfigFactory.load("env.conf")
val topicconf = ConfigFactory.load("topics.conf")
// Create context with custom second batch interval
val sparkConf = new SparkConf().setAppName("pt_streams")
val ssc = new StreamingContext(sparkConf, Seconds(conf.getString("kafka.duration").toLong))
var kafka_topics="kafka.topics"
// Create direct kafka stream with brokers and topics
var topicsSet = topicconf.getString(kafka_topics).split(",").toSet
if(args.length==2 ) {
print ("This stream job will process table(s) : "+ args(1))
topicsSet=args {1}.split(",").toSet
}
val topicList = topicsSet.toList
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> conf.getString("kafka.brokers"),
"zookeeper.connect" -> conf.getString("kafka.zookeeper"),
"group.id" -> conf.getString("kafka.consumergroups"),
"auto.offset.reset" -> args { 0 },
"enable.auto.commit" -> (conf.getString("kafka.autoCommit").toBoolean: java.lang.Boolean),
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"security.protocol" -> "SASL_PLAINTEXT")
val messages = KafkaUtils.createDirectStream[String, String](
ssc,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicsSet, kafkaParams))
for (i <- 0 until topicList.length) {
/**
* set timer to see how much time takes for the filter operation for each topics
*/
val topicStream = messages.filter(_.topic().equals(topicList(i)))
val data = topicStream.map(_.value())
data.foreachRDD((rdd, batchTime) => {
// val data = rdd.map(_.value())
if (!rdd.isEmpty()) {
rdd.coalesce(1).saveAsTextFile(conf.getString("hdfs.streamoutpath") + PATH_SEPERATOR + topicList(i) + PATH_SEPERATOR + date_today.format(System.currentTimeMillis())
+ PATH_SEPERATOR + date_today_hour.format(System.currentTimeMillis()) + PATH_SEPERATOR + System.currentTimeMillis())
}
})
}
try{
// After all successful processing, commit the offsets to kafka
messages.foreachRDD { rdd =>
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
messages.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
}
} catch {
case e: Exception =>
e.printStackTrace()
print("error while commiting the offset")
}
// Start the computation
ssc.start()
ssc.awaitTermination()
}
}
You're better off using the HDFS connector for Kafka Connect. It is open source and available standalone or as part of Confluent Platform . Simple configuration file to stream from Kafka topics to HDFS, and as a bonus it will create the Hive table for you if you have a schema for your data.
You're re-inventing the wheel if you try to code this yourself; it's a solved problem :)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.