简体   繁体   English

从卡夫卡读到火花的数据在注册成表后会消失吗?

[英]Data read from kafka into spark disappears after registration as a table?

Consider data written from a dataframe to kafka and then read from kafka back out to a new dataframe : 考虑从dataframe写入kafkadataframe ,然后从kafka读取回到新的dataframe

// Write from df to kafka
val wdf  = airj.write
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "air2008")
  .save

Now read the data back 现在回读一下数据

// Read from kafka into spark df
import org.apache.spark.sql.functions._
val flights = (spark.read
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "air2008")
  .load())

How many records? 有多少条记录?

scala> flights.count
res36: Long = 5824436

Let's register this as a table: 我们将其注册为表格:

flights.createOrReplaceTempView("flights_raw")

Let's ask that a different way : how many records .. ??! 让我们问一个不同的方式记录多少...... ??!

spark.sql("select count(1) from flights_raw").show
+--------+
|count(1)|
+--------+
|0       |
+--------+

Let's ask the question the first way again: 让我们再次问这个问题:

scala> flights.count
res40: Long = 0

What happened here ? 这里发生了什么 ?

Based on a comment from @GiorgosMyrianthous I put a _cache_ in. It only helps if done before the createOrReplaceTempView : as follows 根据@GiorgosMyrianthous的评论,我放了一个_cache_ 。只有在createOrReplaceTempView 之前完成时才 createOrReplaceTempView :如下所示

Does not work: 工作:

import org.apache.spark.sql.functions._
val flights = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "air2008")
  .load()
flights.createOrReplaceTempView("flights_raw").cache

works : 工作

import org.apache.spark.sql.functions._
val flights = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "air2008")
  .load()
flights.cache
flights.createOrReplaceTempView("flights_raw")

Now it works 现在它有效

scala> flights.count
res47: Long = 5824436

createOrReplaceTempView is lazily evaluated meaning that it does not persist to memory. createOrReplaceTempView被懒惰地评估,意味着它不会持久存储到内存中。 To do so, you'd have to cache the data. 为此,您必须cache数据。

flights.cache
flights.createOrReplaceTempView("flights_raw")

or 要么

flights.createOrReplaceTempView("flights_raw")
spark.table("flights_raw")
spark.table("flights_raw").cache
spark.table("flights_raw").count

should do the trick. 应该做的伎俩。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在Spark Streaming中从Kafka读取自定义类中的数据 - Read data in custom class from kafka in Spark streaming 从 Kafka 主题读取数据并使用 scala 和 spark 写回 Kafka 主题 - Read from Kafka topic process the data and write back to Kafka topic using scala and spark 如何在 spark 3.0 结构化流媒体中使用 kafka.group.id 和检查点以继续从 Kafka 中读取它在重启后停止的位置? - How to use kafka.group.id and checkpoints in spark 3.0 structured streaming to continue to read from Kafka where it left off after restart? 如何在apache spark中使用来自kafka主题的scala来读取json数据 - How to read json data using scala from kafka topic in apache spark 如何从Spark中的Hbase表读取数据? - How can i read data from Hbase table in Spark? Spark-合并后,聚合列从DataFrame中消失 - Spark - aggregated column disappears from a DataFrame after join 在 Spark 中,无法使用来自 Kafka 主题的数据 - In Spark, Unable to consume data from Kafka Topic 从检查点重新启动后,Spark 流选项卡消失 - Spark streaming tab disappears after restarting from checkpoint 如何使用 Spark Structured Streaming 将数据从 Kafka 主题流式传输到 Delta 表 - How to stream data from Kafka topic to Delta table using Spark Structured Streaming 如何从Spark Streaming开始从Kafka主题中读取记录? - How to read records from Kafka topic from beginning in Spark Streaming?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM