[英]How to use fully formed SQL with spark structured streaming
Documentation for Spark structured streaming says that - as of spark 2.3 all methods on the spark context available for static DataFrame
/ DataSet
's are also available for use with structured streaming DataFrame
/ DataSet
's as well. Spark结构化流媒体的文档说 - 从Spark 2.3开始,可用于静态 DataFrame
/ DataSet
的spark上下文的所有方法也可用于结构化流 DataFrame
/ DataSet
。 However I have yet to run across any examples of same. 但是我还没有碰到任何相同的例子 。
Using fully formed SQL's is more flexible, expressive, and productive for me than the DSL
. 使用完全形成的SQL比DSL
更灵活,更富有表现力和更高效。 In addition for my use case those SQL's are already developed and well tested for static versions. 除了我的用例,这些SQL已经开发并经过了静态版本的测试。 There must be some rework - in particular to use join
s in place of correlated subqueries
. 必须进行一些返工 - 特别是使用join
s代替correlated subqueries
。 However there is still much value in retaining the overall full-bodied sql structure. 但是,保留整体的完整sql结构仍然有很多价值。
The format for which I am looking to use is like this hypothetical join: 我想要使用的格式就像这个假设的连接:
val tabaDf = spark.readStream(..)
val tabbDf = spark.readStream(..)
val joinSql = """select a.*,
b.productName
from taba
join tabb
on a.productId = b.productId
where ..
group by ..
having ..
order by .."""
val joinedStreamingDf = spark.sql(joinSql)
There are a couple of items that are not clear how to do: 有几个项目不清楚如何做:
Are the tabaDf
and tabbDf
supposed to be defined via spark.readStream
: this is my assumption 是否应该通过spark.readStream
定义tabaDf
和tabbDf
:这是我的假设
How to declare taba
and tabb
. 如何声明taba
和tabb
。 Trying to use 试着用
tabaDf.createOrReplaceTempView("taba") tabbDf.createOrReplaceTempView("tabb")
results in 结果是
WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException WARN ObjectStore:无法获取数据库global_temp,返回NoSuchObjectException
All of the examples I could find are using the DSL
and/or the selectExpr()
- like the following https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-structured-streaming-in-apache-spark-2-2.html 我能找到的所有示例都使用DSL
和/或selectExpr()
- 如下所示https://databricks.com/blog/2017/04/26/processing-data-in-apache-kafka-with-结构化流媒体功能于Apache的火花2-2.html
df.selectExpr("CAST(userId AS STRING) AS key", "to_json(struct(*)) AS value")
or using select
: 或使用select
:
sightingLoc
.groupBy("zip_code", window("start_time", "1 hour"))
.count()
.select(
to_json(struct("zip_code", "window")).alias("key"),
col("count").cast("string").alias("value"))
Are those truly the only options - so that the documentation saying that all methods supported on the static
dataframe/datasets are not really accurate? 那些真的是唯一的选择 - 所以文档说static
数据框/数据集支持的所有方法都不是真的准确吗? Otherwise: aAny pointers on how to correct the above issue(s) and use straight-up sql
with streaming would be appreciated. 否则:aAny指出如何纠正上述问题并使用直接sql
与流媒体将不胜感激。
The streams need to be registered as temporary views using createOrReplaceTempView
. 需要使用createOrReplaceTempView
将流注册为临时视图。 AFAIK createOrReplaceView
is not a part of the Spark API (perhaps you have something that provides an implicit conversions to a class with such method). AFAIK createOrReplaceView
不是Spark API的一部分(也许您可以通过这种方法为类提供隐式转换)。
spark.readStream(..).createOrReplaceTempView("taba")
spark.readStream(..).createOrReplaceTempView("tabb")
Now the views can be accessed using pure SQL. 现在可以使用纯SQL访问视图。 For example, to print the output to console: 例如,要将输出打印到控制台:
spark
.sql(joinSql)
.writeStream
.format("console")
.start()
.awaitTermination()
Edit: After question edit, I don't see anything wrong with your code. 编辑:问题编辑后,我没有看到您的代码有任何问题。 Here is a minimal working example. 这是一个最小的工作示例。 Assuming a test file /tmp/foo/foo.csv
假设一个测试文件/tmp/foo/foo.csv
"a",1
"b",2
import org.apache.spark.sql.types._
val schema = StructType(Array(StructField("s", StringType), StructField("i", IntegerType)))
spark.readStream
.schema(schema)
.csv("/tmp/foo")
.createOrReplaceTempView("df1")
spark.readStream
.schema(schema)
.csv("/tmp/foo")
.createOrReplaceTempView("df2")
spark.sql("SELECT * FROM df1 JOIN df2 USING (s)")
.writeStream
.format("console")
.start()
.awaitTermination()
outputs 输出
-------------------------------------------
Batch: 0
-------------------------------------------
+---+---+---+
| s| i| i|
+---+---+---+
| b| 2| 2|
| a| 1| 1|
+---+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.