简体   繁体   English

使用 spark dataframe 阅读 kafka 主题

[英]Reading kafka topic using spark dataframe

I want to create dataframe on top of kafka topic and after that i want to register that dataframe as temp table to perform minus operation on data.我想在 kafka 主题之上创建 dataframe ,然后我想将该 dataframe 注册为临时表以对数据执行减法操作。 I have written below code.我写了下面的代码。 But while querying registered table I'm getting error "org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"但是在查询注册表时出现错误“org.apache.spark.sql.AnalysisException:必须使用 writeStream.start();; 执行带有流源的查询”

org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._

val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "SERVER ******").option("subscribe", "TOPIC_NAME").option("startingOffsets", "earliest").load()

df.printSchema()

val personStringDF = df.selectExpr("CAST(value AS STRING)")

val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))


val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")

personDF.registerTempTable("final_df1")

spark.sql("select * from final_df1").show 

ERROR:---------- "org.apache.spark.sql.AnalysisException: Queries with streaming sources must be executed with writeStream.start();;"错误:---------“org.apache.spark.sql.AnalysisException:必须使用 writeStream.start(); 执行带有流源的查询;”

Also i have used start() method and I'm getting below error.我也使用了 start() 方法,但我遇到了错误。

20/08/11 00:59:30 ERROR streaming.MicroBatchExecution: Query final_df1 [id = 1a3e2ea4-2ec1-42f8-a5eb-8a12ce0fb3f5, runId = 7059f3d2-21ec-43c4-b55a-8c735272bf0f] terminated with error java.lang.AbstractMethodError 20/08/11 00:59:30 ERROR streaming.MicroBatchExecution: Query final_df1 [id = 1a3e2ea4-2ec1-42f8-a5eb-8a12ce0fb3f5, runId = 7059f3d2-21ec-43c4-b55a-8c735272bf0f] terminated with error java.lang.AbstractMethodError

NOTE: My main objective behind writing this script is i want to write minus query on this data and want to compare it with one of the register table i have on cluster.注意:我编写此脚本的主要目标是我想对此数据编写减号查询,并希望将其与集群中的一个注册表进行比较。 So, to summarise If I'm sending 1000 records in kafka topic from oracle database, I'm creating dataframe on top of oracle table, registering it as temp table and same I'm doing with kafka topic.所以,总结一下,如果我从 oracle 数据库中发送 1000 条 kafka 主题记录,我将在 ZA189C633D9995E11BF8607170EC9A4B8 表的顶部创建 dataframe,并将其注册为 kafka temp table。 Than i want to run minus query between source(oracle) and target(kafka topic).比我想在源(oracle)和目标(kafka 主题)之间运行减号查询。 to perform 100% data validation between source and target.在源和目标之间执行 100% 的数据验证。 (Registering kafka topic as temporary table is possible?) (可以将kafka主题注册为临时表吗?)

Use memory sink instead of registerTempTable.使用memory sink 而不是 registerTempTable。 Check below code.检查下面的代码。

org.apache.spark.sql.types.DataType
org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types._

val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "SERVER ******")
.option("subscribe", "TOPIC_NAME")
.option("startingOffsets", "earliest")
.load()

df.printSchema()

val personStringDF = df.selectExpr("CAST(value AS STRING)")

val user_schema =StructType(Array(StructField("OEM",StringType,true),StructField("IMEI",StringType,true),StructField("CUSTOMER_ID",StringType,true),StructField("REQUEST_SOURCE",StringType,true),StructField("REQUESTER",StringType,true),StructField("REQUEST_TIMESTAMP",StringType,true),StructField("REASON_CODE",StringType,true)))


val personDF = personStringDF.select(from_json(col("value"),user_schema).as("data")).select("data.*")


personDF
.writeStream
.outputMode("append")
.format("memory")
.queryName("final_df1").start()

spark.sql("select * from final_df1").show(10,false)

Streaming DataFrame doesn't support the show() method.流式传输 DataFrame 不支持show()方法。 When you call start() method, it will start a background thread to stream the input data to the sink, and since you are using ConsoleSink, it will output the data to the console.当您调用start()方法时,它会启动一个后台线程 stream 将输入数据发送到接收器,并且由于您使用的是 ConsoleSink,它会将 output 数据发送到控制台。 You don't need to call show() .您不需要调用show()

remove the below lines,删除以下行,

personDF.registerTempTable("final_df1")
spark.sql("select * from final_df1").show 

and add the below or equivalent lines instead,并添加以下或等效的行,

val query1 = personDF.writeStream.queryName("final_df1").format("memory").outputMode("append").start()
query1.awaitTermination()

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM