简体   繁体   中英

Spark Structured Streaming from kafka to save data in Cassandra in Distributed fashion

I'am trying to create a structured Streaming from Kafka into Spark which is a json string. Now want to parse the json into specific column and then save the dataframe to cassandra table with optimum speed. Using Spark 2.4 and cassandra 2.11 (Apache) and Not DSE.

I have tried creating a Direct Stream which gives DStream of case class which I was saving into Cassandra using foreachRDD on DStream but this gets hang after every 6-7 days. So was trying to stream which gives dataframe directly and can be saved to Cassandra.

val conf = new SparkConf()
          .setMaster("local[3]")
      .setAppName("Fleet Live Data")
      .set("spark.cassandra.connection.host", "ip")
      .set("spark.cassandra.connection.keep_alive_ms", "20000")
      .set("spark.cassandra.auth.username", "user")
      .set("spark.cassandra.auth.password", "pass")
      .set("spark.streaming.stopGracefullyOnShutdown", "true")
      .set("spark.executor.memory", "2g")
      .set("spark.driver.memory", "2g")
      .set("spark.submit.deployMode", "cluster")
      .set("spark.executor.instances", "4")
      .set("spark.executor.cores", "2")
      .set("spark.cores.max", "9")
      .set("spark.driver.cores", "9")
      .set("spark.speculation", "true")
      .set("spark.locality.wait", "2s")

val spark = SparkSession
  .builder
  .appName("Fleet Live Data")
  .config(conf)
  .getOrCreate()
println("Spark Session Config Done")

val sc = SparkContext.getOrCreate(conf)
sc.setLogLevel("ERROR")
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
 val topics = Map("livefleet" -> 1)
import spark.implicits._
implicit val formats = DefaultFormats

 val df = spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "brokerIP:port")
  .option("subscribe", "livefleet")
  .load()

val collection = df.selectExpr("CAST(value AS STRING)").map(f => parse(f.toString()).extract[liveevent])

val query = collection.writeStream
  .option("checkpointLocation", "/tmp/check_point/")
  .format("kafka")
  .format("org.apache.spark.sql.cassandra")
  .option("keyspace", "trackfleet_db")
  .option("table", "locationinfotemp1")
  .outputMode(OutputMode.Update)
  .start()
  query.awaitTermination()

Expected is to save the dataframe to cassandra. But getting this error :-

Exception in thread "main" org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start()

Based on the error message, I would say Cassandra is not a Streaming Sink, and I believe you need to use .write

collection.write
    .format("org.apache.spark.sql.cassandra")
    .options(...)
    .save() 

or

import org.apache.spark.sql.cassandra._

// ...
collection.cassandraFormat(table, keyspace).save()

Docs: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#example-using-helper-commands-to-write-datasets


But that may only be for dataframes, for streaming sources, see this example , which uses .saveToCassandra

import com.datastax.spark.connector.streaming._

// ...
val wc = stream.flatMap(_.split("\\s+"))
    .map(x => (x, 1))
    .reduceByKey(_ + _)
    .saveToCassandra("streaming_test", "words", SomeColumns("word", "count")) 

ssc.start()

And if that doesn't work, you do need a ForEachWriter

collection.writeStream
  .foreach(new ForeachWriter[Row] {

  override def process(row: Row): Unit = {
    println(s"Processing ${row}")
  }

  override def close(errorOrNull: Throwable): Unit = {}

  override def open(partitionId: Long, version: Long): Boolean = {
    true
  }
})
.start()

Also worth mentioning, that Datastax released a Kafka Connector, and Kafka Connect is included with your Kafka installation (assuming 0.10.2) or later. You can find its announcement here

If you are using Spark 2.4.0, then try using the foreachbatch writer. It uses the batch based writers on the streaming queries.

    val query= test.writeStream
       .foreachBatch((batchDF, batchId) =>
        batchDF.write
               .format("org.apache.spark.sql.cassandra")
               .mode(saveMode)
               .options(Map("keyspace" -> keySpace, "table" -> tableName))
               .save())
      .trigger(Trigger.ProcessingTime(3000))
      .option("checkpointLocation", /checkpointing")
      .start
   query.awaitTermination()

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM