简体   繁体   中英

Spark Streaming From Kafka and Write to HDFS in Avro Format

I basically want to consumes data from Kafka and write it to HDFS. But happens so is that it is not writing any files in hdfs. it create empty files.

And also please guide me if i want to write in avro format in hdfs how can i modify the code.

For the sake of simplicity am writing to local C drive.

import org.apache.spark.SparkConf
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkContext
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import 
org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.kafka.common.serialization.StringDeserializer

object KafkaStreaming extends App{
val conf = new org.apache.spark.SparkConf().setMaster("local[*]").setAppName("kafka-streaming")
val conext = new SparkContext(conf)
val ssc = new StreamingContext(conext, org.apache.spark.streaming.Milliseconds(1))
val kafkaParams = Map[String, Object](
  "bootstrap.servers" -> "localhost:9092",
  "key.deserializer" -> classOf[StringDeserializer],
  "value.deserializer" -> classOf[StringDeserializer],
  "group.id" -> "group",
  "auto.offset.reset" -> "latest",
  "enable.auto.commit" -> (true: java.lang.Boolean))
val topics = Array("topic")
val stream = KafkaUtils.createDirectStream[String, String](
  ssc,
  PreferConsistent,
  Subscribe[String, String](topics, kafkaParams))
val lines = stream.map(_.value)
stream.foreachRDD(rdd => {
  rdd.coalesce(1).saveAsTextFile("C:/data/spark/")
})
ssc.start()
ssc.awaitTermination()}

And below is the build.sbt

name := "spark-streaming"
version := "1.0"
scalaVersion := "2.11.8" 
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "2.2.0"
libraryDependencies += "org.apache.spark" % "spark-streaming_2.11" % "2.2.0"
libraryDependencies += "org.apache.spark" % "spark-streaming-kafka-0-
10_2.11" % "2.2.0"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.11.0.1"

Not writing any files in hdfs. it create empty files.

Please check how to debug here

Unable to see messages from Kafka Stream in Spark

please guide me if i want to write in avro format in hdfs

https://github.com/sryza/simplesparkavroapp

package com.cloudera.sparkavro

import org.apache.avro.mapred.AvroKey
import org.apache.avro.mapreduce.{AvroJob, AvroKeyOutputFormat}
import org.apache.hadoop.fs.Path
import org.apache.hadoop.io.NullWritable
import org.apache.hadoop.mapreduce.Job
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.SparkContext._

object SparkSpecificAvroWriter {
  def main(args: Array[String]) {
    val outPath = args(0)

    val sparkConf = new SparkConf().setAppName("Spark Avro")
    MyKryoRegistrator.register(sparkConf)
    val sc = new SparkContext(sparkConf)

    val user1 = new User("Alyssa", 256, null)
    val user2 = new User("Ben", 7, "red")

    val records = sc.parallelize(Array(user1, user2))
    val withValues = records.map((x) => (new AvroKey(x), NullWritable.get))

    val conf = new Job()
    FileOutputFormat.setOutputPath(conf, new Path(outPath))
    val schema = User.SCHEMA$
    AvroJob.setOutputKeySchema(conf, schema)
    conf.setOutputFormatClass(classOf[AvroKeyOutputFormat[User]])
    withValues.saveAsNewAPIHadoopDataset(conf.getConfiguration)
  }
}

Saw your code, you can simply append current timestamp to the files you are writing.

That should solve your problem. :)

==========

If you want to append all the files into one file, then you can use dataframes as below:

I would not recommend using append in HDFS because of the way this Filesystem is designed. But here is what you can try.

  1. Create a dataframe from your RDD
  2. Use the Dataframe's save mode as ("append") and then write the file.

eg:

val dataframe = youRdd.toDF(); dataframe.write().mode(SaveMode.Append).format(FILE_FORMAT)..save(path);

See if that helps

Before running your Kafka consumer application below point you have to check:

  • check data is available in Kafka or not

  • change auto.offset.reset to earliest Here the earliest means kafka automatically reset the offset to the earliest offset.

  • Start Kafka console producer application and start typing some messages. Then start your Kafka consumer code, again type some messages on Kafka console producer, then check if messages are printing to consumer console.

You can write the output in avro format using below line of code

spark.write.avro("<path>")

I hope this will help you

change this from "auto.offset.reset" -> "latest",

to

"auto.offset.reset" -> "earliest",

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM