[英]Spark 3.2.0 Structured Streaming save data to Kafka with Confluent Schema Registry
[英]Spark Structured Streaming from kafka to save data in Cassandra in Distributed fashion
我正在尝试创建从Kafka到Spark的结构化流,这是一个json字符串。 现在想将json解析为特定的列,然后以最佳速度将数据帧保存到cassandra表中。 使用Spark 2.4和cassandra 2.11(Apache)而非DSE。
我尝试创建一个直接流,该流提供了案例类的DStream,我在DStream上使用foreachRDD将其保存到Cassandra中,但是每隔6-7天就会挂起。 因此,尝试流式处理直接提供数据帧并可以将其保存到Cassandra。
val conf = new SparkConf()
.setMaster("local[3]")
.setAppName("Fleet Live Data")
.set("spark.cassandra.connection.host", "ip")
.set("spark.cassandra.connection.keep_alive_ms", "20000")
.set("spark.cassandra.auth.username", "user")
.set("spark.cassandra.auth.password", "pass")
.set("spark.streaming.stopGracefullyOnShutdown", "true")
.set("spark.executor.memory", "2g")
.set("spark.driver.memory", "2g")
.set("spark.submit.deployMode", "cluster")
.set("spark.executor.instances", "4")
.set("spark.executor.cores", "2")
.set("spark.cores.max", "9")
.set("spark.driver.cores", "9")
.set("spark.speculation", "true")
.set("spark.locality.wait", "2s")
val spark = SparkSession
.builder
.appName("Fleet Live Data")
.config(conf)
.getOrCreate()
println("Spark Session Config Done")
val sc = SparkContext.getOrCreate(conf)
sc.setLogLevel("ERROR")
val ssc = new StreamingContext(sc, Seconds(10))
val sqlContext = new SQLContext(sc)
val topics = Map("livefleet" -> 1)
import spark.implicits._
implicit val formats = DefaultFormats
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "brokerIP:port")
.option("subscribe", "livefleet")
.load()
val collection = df.selectExpr("CAST(value AS STRING)").map(f => parse(f.toString()).extract[liveevent])
val query = collection.writeStream
.option("checkpointLocation", "/tmp/check_point/")
.format("kafka")
.format("org.apache.spark.sql.cassandra")
.option("keyspace", "trackfleet_db")
.option("table", "locationinfotemp1")
.outputMode(OutputMode.Update)
.start()
query.awaitTermination()
预期是将数据帧保存到cassandra。 但是得到这个错误:-
线程“主”中的异常org.apache.spark.sql.AnalysisException:具有流源的查询必须使用writeStream.start()执行
根据错误消息,我会说Cassandra不是Streaming Sink,并且我相信您需要使用.write
collection.write
.format("org.apache.spark.sql.cassandra")
.options(...)
.save()
要么
import org.apache.spark.sql.cassandra._
// ...
collection.cassandraFormat(table, keyspace).save()
文件: https : //github.com/datastax/spark-cassandra-connector/blob/master/doc/14_data_frames.md#example-using-helper-commands-to-write-datasets
但这可能仅适用于数据帧,流源,请参见此示例 , 该示例使用.saveToCassandra
import com.datastax.spark.connector.streaming._
// ...
val wc = stream.flatMap(_.split("\\s+"))
.map(x => (x, 1))
.reduceByKey(_ + _)
.saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
ssc.start()
如果那行不通,那么您确实需要一个ForEachWriter
collection.writeStream
.foreach(new ForeachWriter[Row] {
override def process(row: Row): Unit = {
println(s"Processing ${row}")
}
override def close(errorOrNull: Throwable): Unit = {}
override def open(partitionId: Long, version: Long): Boolean = {
true
}
})
.start()
同样值得一提的是,Datastax发布了Kafka连接器,并且Kafka Connect随您的Kafka安装(假定为0.10.2)或更高版本一起提供。 你可以在这里找到它的公告
如果您使用的是Spark 2.4.0,请尝试使用foreachbatch编写器。 它在流查询中使用基于批处理的编写器。
val query= test.writeStream
.foreachBatch((batchDF, batchId) =>
batchDF.write
.format("org.apache.spark.sql.cassandra")
.mode(saveMode)
.options(Map("keyspace" -> keySpace, "table" -> tableName))
.save())
.trigger(Trigger.ProcessingTime(3000))
.option("checkpointLocation", /checkpointing")
.start
query.awaitTermination()
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.