简体   繁体   English

如何使用完全形成的SQL与spark结构化流

[英]How to use fully formed SQL with spark structured streaming

Documentation for Spark structured streaming says that - as of spark 2.3 all methods on the spark context available for static DataFrame / DataSet 's are also available for use with structured streaming DataFrame / DataSet 's as well. Spark结构化流媒体的文档说 - 从Spark 2.3开始,可用于静态 DataFrame / DataSet的spark上下文的所有方法也可用于结构化流 DataFrame / DataSet However I have yet to run across any examples of same. 但是我还没有碰到任何相同的例子

Using fully formed SQL's is more flexible, expressive, and productive for me than the DSL . 使用完全形成的SQL比DSL更灵活,更富有表现力和更高效。 In addition for my use case those SQL's are already developed and well tested for static versions. 除了我的用例,这些SQL已经开发并经过了静态版本的测试。 There must be some rework - in particular to use join s in place of correlated subqueries . 必须进行一些返工 - 特别是使用join s代替correlated subqueries However there is still much value in retaining the overall full-bodied sql structure. 但是,保留整体的完整sql结构仍然有很多价值。

The format for which I am looking to use is like this hypothetical join: 我想要使​​用的格式就像这个假设的连接:

 val tabaDf = spark.readStream(..)
 val tabbDf = spark.readStream(..)

 val joinSql = """select a.*, 
                  b.productName 
                  from taba
                  join tabb 
                  on a.productId = b.productId
                  where ..
                  group by ..
                  having ..
                  order by .."""
 val joinedStreamingDf = spark.sql(joinSql)

There are a couple of items that are not clear how to do: 有几个项目不清楚如何做:

  • Are the tabaDf and tabbDf supposed to be defined via spark.readStream : this is my assumption 是否应该通过spark.readStream定义tabaDftabbDf :这是我的假设

  • How to declare taba and tabb . 如何声明tabatabb Trying to use 试着用

     tabaDf.createOrReplaceTempView("taba") tabbDf.createOrReplaceTempView("tabb") 

    results in 结果是

    WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException WARN ObjectStore:无法获取数据库global_temp,返回NoSuchObjectException

All of the examples I could find are using the DSL and/or the selectExpr() - like the following https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html 我能找到的所有示例都使用DSL和/或selectExpr() - 如下所示https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-结构化流媒体功能于Apache的火花2-2.html

 df.selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value")

or using select : 或使用select

sightingLoc
  .groupBy("zip_code", window("start_time", "1 hour"))
  .count()
  .select( 
    to_json(struct("zip_code", "window")).alias("key"),
    col("count").cast("string").alias("value")) 

Are those truly the only options - so that the documentation saying that all methods supported on the static dataframe/datasets are not really accurate? 那些真的是唯一的选择 - 所以文档说static数据框/数据集支持的所有方法都不是真的准确吗? Otherwise: aAny pointers on how to correct the above issue(s) and use straight-up sql with streaming would be appreciated. 否则:aAny指出如何纠正上述问题并使用直接sql与流媒体将不胜感激。

The streams need to be registered as temporary views using createOrReplaceTempView . 需要使用createOrReplaceTempView将流注册为临时视图。 AFAIK createOrReplaceView is not a part of the Spark API (perhaps you have something that provides an implicit conversions to a class with such method). AFAIK createOrReplaceView不是Spark API的一部分(也许您可以通过这种方法为类提供隐式转换)。

spark.readStream(..).createOrReplaceTempView("taba")
spark.readStream(..).createOrReplaceTempView("tabb")

Now the views can be accessed using pure SQL. 现在可以使用纯SQL访问视图。 For example, to print the output to console: 例如,要将输出打印到控制台:

spark
  .sql(joinSql)
  .writeStream
  .format("console")
  .start()
  .awaitTermination()

Edit: After question edit, I don't see anything wrong with your code. 编辑:问题编辑后,我没有看到您的代码有任何问题。 Here is a minimal working example. 这是一个最小的工作示例。 Assuming a test file /tmp/foo/foo.csv 假设一个测试文件/tmp/foo/foo.csv

"a",1
"b",2
import org.apache.spark.sql.types._
val schema = StructType(Array(StructField("s", StringType), StructField("i", IntegerType)))
spark.readStream
  .schema(schema)
  .csv("/tmp/foo")
  .createOrReplaceTempView("df1")
spark.readStream
  .schema(schema)
  .csv("/tmp/foo")
  .createOrReplaceTempView("df2")

spark.sql("SELECT * FROM df1 JOIN df2 USING (s)")
  .writeStream
  .format("console")
  .start()
  .awaitTermination()

outputs 输出

-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+---+
|  s|  i|  i|
+---+---+---+
|  b|  2|  2|
|  a|  1|  1|
+---+---+---+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM