[英]Data read from kafka into spark disappears after registration as a table?
Consider data written from a dataframe
to kafka
and then read from kafka
back out to a new dataframe
: 考虑从dataframe
写入kafka
的dataframe
,然后从kafka
读取回到新的dataframe
:
// Write from df to kafka
val wdf = airj.write
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("topic", "air2008")
.save
Now read the data back 现在回读一下数据
// Read from kafka into spark df
import org.apache.spark.sql.functions._
val flights = (spark.read
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "air2008")
.load())
How many records? 有多少条记录?
scala> flights.count
res36: Long = 5824436
Let's register this as a table: 我们将其注册为表格:
flights.createOrReplaceTempView("flights_raw")
Let's ask that a different way : how many records .. ??! 让我们问一个不同的方式 : 记录多少...... ??!
spark.sql("select count(1) from flights_raw").show
+--------+
|count(1)|
+--------+
|0 |
+--------+
Let's ask the question the first way again: 让我们再次问这个问题:
scala> flights.count
res40: Long = 0
What happened here ? 这里发生了什么 ?
Based on a comment from @GiorgosMyrianthous I put a _cache_
in. It only helps if done before the createOrReplaceTempView
: as follows 根据@GiorgosMyrianthous的评论,我放了一个_cache_
。只有在createOrReplaceTempView
之前完成时才 createOrReplaceTempView
:如下所示
Does not work: 不工作:
import org.apache.spark.sql.functions._
val flights = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "air2008")
.load()
flights.createOrReplaceTempView("flights_raw").cache
works : 工作 :
import org.apache.spark.sql.functions._
val flights = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "air2008")
.load()
flights.cache
flights.createOrReplaceTempView("flights_raw")
Now it works 现在它有效
scala> flights.count
res47: Long = 5824436
createOrReplaceTempView
is lazily evaluated meaning that it does not persist to memory. createOrReplaceTempView
被懒惰地评估,意味着它不会持久存储到内存中。 To do so, you'd have to cache
the data. 为此,您必须cache
数据。
flights.cache
flights.createOrReplaceTempView("flights_raw")
or 要么
flights.createOrReplaceTempView("flights_raw")
spark.table("flights_raw")
spark.table("flights_raw").cache
spark.table("flights_raw").count
should do the trick. 应该做的伎俩。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.