[英]How to Integrate Kafka with Spark Structured Streaming with MongoDB Sink
I am trying to Integrate Kafka with Spark-Structured-Streaming to MongoDB Sink.我正在尝试将 Kafka 与 Spark-Structured-Streaming 集成到 MongoDB Sink。 I need help on correcting my code if i am going wrong
如果我出错了,我需要帮助来纠正我的代码
Got integrated Kafka-Spark and Spark-Mongo.集成了 Kafka-Spark 和 Spark-Mongo。 Now trying to integrate the pipeline from Kafka-Spark-Mongo
现在尝试整合来自 Kafka-Spark-Mongo 的管道
import org.apache.spark.sql.streaming.Trigger
import com.mongodb.spark.sql._
import org.apache.spark.streaming._
import com.mongodb.spark._
import com.mongodb.spark.config._
import org.bson.Document
//Creates readStream from Kafka
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "10.170.172.45:9092, 10.180.172.46:9092, 10.190.172.100:9092")
.option("subscribe", "HANZO_TEST_P2_R2, TOPIC_WITH_COMP_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.load()
//The read kafka streaming will need to converted to string from Binary format
val dfs = df.selectExpr("CAST(value AS STRING)").toDF()
//The below logic extracts data from _raw column and in the stream context it is "value"
val extractedDF = dfs
.withColumn("managed_server", regexp_extract($"value", "\\[(.*?)\\] \\[(.*?)\\]",2))
.withColumn("alert_summary", regexp_extract($"value", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
.withColumn("oracle_details", regexp_extract($"value", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
.withColumn("ecid", regexp_extract($"value", "(?<=ecid: )(.*?)(?=,)",1))
.withColumn("CompName",regexp_extract($"value",""".*(composite_name|compositename|composites|componentDN):\s+([a-zA-Z]+)""",2))
.withColumn("composite_name", col("value").contains("composite_name"))
.withColumn("compositename", col("value").contains("compositename"))
.withColumn("composites", col("value").contains("composites"))
.withColumn("componentDN", col("value").contains("componentDN"))
//The below logic filters any NULL values if found
val finalData = extractedDF.filter(
col("managed_server").isNotNull &&
col("alert_summary").isNotNull &&
col("oracle_details").isNotNull &&
col("ecid").isNotNull &&
col("CompName").isNotNull &&
col("composite_name").isNotNull &&
col("compositename").isNotNull &&
col("composites").isNotNull &&
col("componentDN").isNotNull).toDF
val toMongo = MongoSpark.save(finalData.write.option("uri", "mongodb://hanzomdbuser:hanzomdbpswd@dstk8sd.com:27018/HANZO_MDB.Testing").mode("overwrite"))
//The Kafka stream should written and in this case we are writing it to console
val query = toMongo.writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("20 seconds"))
.start()
query.awaitTermination()
I need to integrate these three frameworks using my code and all the streaming results from Kafka after getting processed in Spark will need to be saved in MongoDB in a collection我需要使用我的代码集成这三个框架,并且在 Spark 中处理后来自 Kafka 的所有流式处理结果都需要保存在 MongoDB 中的一个集合中
You need to create you Mongo Sink, instead the "console" that you are using in your example.您需要创建 Mongo Sink,而不是您在示例中使用的“控制台”。 There some available resources that can be helpful like:
有一些可用资源可能会有所帮助,例如:
https://github.com/mongodb/mongo-spark/blob/master/examples/src/test/scala/tour/SparkStructuredStreams.scala https://github.com/mongodb/mongo-spark/blob/master/examples/src/test/scala/tour/SparkStructuredStreams.scala
and和
https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala
and和
https://learningfromdata.blog/2017/04/16/real-time-data-ingestion-with-apache-spark-structured-streaming-implementation/ https://learningfromdata.blog/2017/04/16/real-time-data-ingestion-with-apache-spark-structured-streaming-implementation/
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.