如何将 Kafka 与 Spark Structured Streaming 与 MongoDB Sink 集成

Question

I am trying to Integrate Kafka with Spark-Structured-Streaming to MongoDB Sink.我正在尝试将 Kafka 与 Spark-Structured-Streaming 集成到 MongoDB Sink。 I need help on correcting my code if i am going wrong如果我出错了，我需要帮助来纠正我的代码

Got integrated Kafka-Spark and Spark-Mongo.集成了 Kafka-Spark 和 Spark-Mongo。 Now trying to integrate the pipeline from Kafka-Spark-Mongo现在尝试整合来自 Kafka-Spark-Mongo 的管道

import org.apache.spark.sql.streaming.Trigger
import com.mongodb.spark.sql._
import org.apache.spark.streaming._
import com.mongodb.spark._
import com.mongodb.spark.config._
import org.bson.Document

//Creates readStream from Kafka
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "10.170.172.45:9092, 10.180.172.46:9092, 10.190.172.100:9092")
.option("subscribe", "HANZO_TEST_P2_R2, TOPIC_WITH_COMP_P2_R2, TOPIC_WITH_COMP_P2_R2.DIT, TOPIC_WITHOUT_COMP_P2_R2.DIT")
.load()
//The read kafka streaming will need to converted to string from Binary format
val dfs = df.selectExpr("CAST(value AS STRING)").toDF()

//The below logic extracts data from _raw column and in the stream context it is "value"
val extractedDF = dfs
.withColumn("managed_server", regexp_extract($"value", "\\[(.*?)\\] \\[(.*?)\\]",2))
.withColumn("alert_summary", regexp_extract($"value", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",3))
.withColumn("oracle_details", regexp_extract($"value", "\\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\] \\[(.*?)\\]",5))
.withColumn("ecid", regexp_extract($"value", "(?<=ecid: )(.*?)(?=,)",1))
.withColumn("CompName",regexp_extract($"value",""".*(composite_name|compositename|composites|componentDN):\s+([a-zA-Z]+)""",2))
.withColumn("composite_name", col("value").contains("composite_name"))
.withColumn("compositename", col("value").contains("compositename"))
.withColumn("composites", col("value").contains("composites"))
.withColumn("componentDN", col("value").contains("componentDN"))

//The below logic filters any NULL values if found
val finalData = extractedDF.filter(
      col("managed_server").isNotNull &&
        col("alert_summary").isNotNull &&
        col("oracle_details").isNotNull &&
        col("ecid").isNotNull &&
        col("CompName").isNotNull &&
        col("composite_name").isNotNull &&
        col("compositename").isNotNull &&
        col("composites").isNotNull &&
        col("componentDN").isNotNull).toDF

val toMongo = MongoSpark.save(finalData.write.option("uri", "mongodb://hanzomdbuser:hanzomdbpswd@dstk8sd.com:27018/HANZO_MDB.Testing").mode("overwrite"))

//The Kafka stream should written and in this case we are writing it to console
val query = toMongo.writeStream
.outputMode("append")
.format("console")
.trigger(Trigger.ProcessingTime("20 seconds"))
.start()

query.awaitTermination()

I need to integrate these three frameworks using my code and all the streaming results from Kafka after getting processed in Spark will need to be saved in MongoDB in a collection我需要使用我的代码集成这三个框架，并且在 Spark 中处理后来自 Kafka 的所有流式处理结果都需要保存在 MongoDB 中的一个集合中

Answer 1

You need to create you Mongo Sink, instead the "console" that you are using in your example.您需要创建 Mongo Sink，而不是您在示例中使用的“控制台”。 There some available resources that can be helpful like:有一些可用资源可能会有所帮助，例如：

https://github.com/mongodb/mongo-spark/blob/master/examples/src/test/scala/tour/SparkStructuredStreams.scala https://github.com/mongodb/mongo-spark/blob/master/examples/src/test/scala/tour/SparkStructuredStreams.scala

and和

https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala https://github.com/holdenk/spark-structured-streaming-ml/blob/master/src/main/scala/com/high-performance-spark-examples/structuredstreaming/CustomSink.scala

and和

https://learningfromdata.blog/2017/04/16/real-time-data-ingestion-with-apache-spark-structured-streaming-implementation/ https://learningfromdata.blog/2017/04/16/real-time-data-ingestion-with-apache-spark-structured-streaming-implementation/

如何将 Kafka 与 Spark Structured Streaming 与 MongoDB Sink 集成

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-04-15 09:15:34

如何将 Kafka 与 Spark Structured Streaming 与 MongoDB Sink 集成

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-04-15 09:15:34

解决方案1
2 已采纳 2019-04-15 09:15:34