从卡夫卡读到火花的数据在注册成表后会消失吗？

Question

Consider data written from a dataframe to kafka and then read from kafka back out to a new dataframe : 考虑从dataframe写入kafka的dataframe ，然后从kafka读取回到新的dataframe ：

// Write from df to kafka
val wdf  = airj.write
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("topic", "air2008")
  .save

Now read the data back 现在回读一下数据

// Read from kafka into spark df
import org.apache.spark.sql.functions._
val flights = (spark.read
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "air2008")
  .load())

How many records? 有多少条记录？

scala> flights.count
res36: Long = 5824436

Let's register this as a table: 我们将其注册为表格：

flights.createOrReplaceTempView("flights_raw")

Let's ask that a different way : how many records .. ??! 让我们问一个不同的方式 ： 记录多少...... ??！

spark.sql("select count(1) from flights_raw").show
+--------+
|count(1)|
+--------+
|0       |
+--------+

Let's ask the question the first way again: 让我们再次问这个问题：

scala> flights.count
res40: Long = 0

What happened here ? 这里发生了什么？

Answer 1

Based on a comment from @GiorgosMyrianthous I put a _cache_ in. It only helps if done before the createOrReplaceTempView : as follows 根据@GiorgosMyrianthous的评论，我放了一个_cache_ 。只有在createOrReplaceTempView 之前完成时才 createOrReplaceTempView ：如下所示

Does not work: 不工作：

import org.apache.spark.sql.functions._
val flights = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "air2008")
  .load()
flights.createOrReplaceTempView("flights_raw").cache

works : 工作：

import org.apache.spark.sql.functions._
val flights = spark.readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "air2008")
  .load()
flights.cache
flights.createOrReplaceTempView("flights_raw")

Now it works 现在它有效

scala> flights.count
res47: Long = 5824436

Answer 2

createOrReplaceTempView is lazily evaluated meaning that it does not persist to memory. createOrReplaceTempView被懒惰地评估，意味着它不会持久存储到内存中。 To do so, you'd have to cache the data. 为此，您必须cache数据。

flights.cache
flights.createOrReplaceTempView("flights_raw")

or 要么

flights.createOrReplaceTempView("flights_raw")
spark.table("flights_raw")
spark.table("flights_raw").cache
spark.table("flights_raw").count

should do the trick. 应该做的伎俩。

从卡夫卡读到火花的数据在注册成表后会消失吗？

问题描述

2 个解决方案

解决方案1
0 2019-04-07 22:20:29

解决方案2
0 已采纳 2019-04-07 22:25:28

从卡夫卡读到火花的数据在注册成表后会消失吗？

问题描述

2 个解决方案

解决方案1 0 2019-04-07 22:20:29

解决方案2 0 已采纳 2019-04-07 22:25:28

解决方案1
0 2019-04-07 22:20:29

解决方案2
0 已采纳 2019-04-07 22:25:28