如何使用完全形成的SQL与spark结构化流

Question

Documentation for Spark structured streaming says that - as of spark 2.3 all methods on the spark context available for static DataFrame / DataSet 's are also available for use with structured streaming DataFrame / DataSet 's as well. Spark结构化流媒体的文档说 - 从Spark 2.3开始，可用于静态 DataFrame / DataSet的spark上下文的所有方法也可用于结构化流 DataFrame / DataSet 。 However I have yet to run across any examples of same. 但是我还没有碰到任何相同的例子。

Using fully formed SQL's is more flexible, expressive, and productive for me than the DSL . 使用完全形成的SQL比DSL更灵活，更富有表现力和更高效。 In addition for my use case those SQL's are already developed and well tested for static versions. 除了我的用例，这些SQL已经开发并经过了静态版本的测试。 There must be some rework - in particular to use join s in place of correlated subqueries . 必须进行一些返工 - 特别是使用join s代替correlated subqueries 。 However there is still much value in retaining the overall full-bodied sql structure. 但是，保留整体的完整sql结构仍然有很多价值。

The format for which I am looking to use is like this hypothetical join: 我想要使用的格式就像这个假设的连接：

 val tabaDf = spark.readStream(..)
 val tabbDf = spark.readStream(..)

 val joinSql = """select a.*, 
                  b.productName 
                  from taba
                  join tabb 
                  on a.productId = b.productId
                  where ..
                  group by ..
                  having ..
                  order by .."""
 val joinedStreamingDf = spark.sql(joinSql)

There are a couple of items that are not clear how to do: 有几个项目不清楚如何做：

Are the tabaDf and tabbDf supposed to be defined via spark.readStream : this is my assumption 是否应该通过spark.readStream定义tabaDf和tabbDf ：这是我的假设
How to declare taba and tabb . 如何声明taba和tabb 。 Trying to use 试着用
```
 tabaDf.createOrReplaceTempView("taba") tabbDf.createOrReplaceTempView("tabb") 
```
results in 结果是

WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException WARN ObjectStore：无法获取数据库global_temp，返回NoSuchObjectException

All of the examples I could find are using the DSL and/or the selectExpr() - like the following https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html 我能找到的所有示例都使用DSL和/或selectExpr() - 如下所示https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-结构化流媒体功能于Apache的火花2-2.html

 df.selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value")

or using select : 或使用select ：

sightingLoc
  .groupBy("zip_code", window("start_time", "1 hour"))
  .count()
  .select( 
    to_json(struct("zip_code", "window")).alias("key"),
    col("count").cast("string").alias("value"))

Are those truly the only options - so that the documentation saying that all methods supported on the static dataframe/datasets are not really accurate? 那些真的是唯一的选择 - 所以文档说static数据框/数据集支持的所有方法都不是真的准确吗？ Otherwise: aAny pointers on how to correct the above issue(s) and use straight-up sql with streaming would be appreciated. 否则：aAny指出如何纠正上述问题并使用直接sql与流媒体将不胜感激。

Answer 1

The streams need to be registered as temporary views using createOrReplaceTempView . 需要使用createOrReplaceTempView将流注册为临时视图。 AFAIK createOrReplaceView is not a part of the Spark API (perhaps you have something that provides an implicit conversions to a class with such method). AFAIK createOrReplaceView不是Spark API的一部分（也许您可以通过这种方法为类提供隐式转换）。

spark.readStream(..).createOrReplaceTempView("taba")
spark.readStream(..).createOrReplaceTempView("tabb")

Now the views can be accessed using pure SQL. 现在可以使用纯SQL访问视图。 For example, to print the output to console: 例如，要将输出打印到控制台：

spark
  .sql(joinSql)
  .writeStream
  .format("console")
  .start()
  .awaitTermination()

Edit: After question edit, I don't see anything wrong with your code. 编辑：问题编辑后，我没有看到您的代码有任何问题。 Here is a minimal working example. 这是一个最小的工作示例。 Assuming a test file /tmp/foo/foo.csv 假设一个测试文件/tmp/foo/foo.csv

"a",1
"b",2

import org.apache.spark.sql.types._
val schema = StructType(Array(StructField("s", StringType), StructField("i", IntegerType)))
spark.readStream
  .schema(schema)
  .csv("/tmp/foo")
  .createOrReplaceTempView("df1")
spark.readStream
  .schema(schema)
  .csv("/tmp/foo")
  .createOrReplaceTempView("df2")

spark.sql("SELECT * FROM df1 JOIN df2 USING (s)")
  .writeStream
  .format("console")
  .start()
  .awaitTermination()

outputs 输出

-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+---+
|  s|  i|  i|
+---+---+---+
|  b|  2|  2|
|  a|  1|  1|
+---+---+---+

如何使用完全形成的SQL与spark结构化流

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-04-14 16:13:20

如何使用完全形成的SQL与spark结构化流

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-04-14 16:13:20

解决方案1
3 已采纳 2019-04-14 16:13:20