使用结构化流 (PySpark) 运行链式查询

Question

我的代码是这样的

df = spark.readStream.option("header","true") \
    .schema(df_schema)\
    .csv(df_file)
df2 = df.filter(df.col == 1)
df3 = df2.withColumn("new_col", udf_f(df2.some_col))
dfc = df3.where(df3.new_col == 2).count()
query = dfc.writeStream.outputMode("append").format("console").start()
query.awaitTermination()

我收到错误消息Queries with streaming sources must be executed with writeStream.start() dfc Queries with streaming sources must be executed with writeStream.start() at the dfc line 但我不确定我做错了什么。 Spark 结构化流不支持这样的链式查询吗？ 据我所知，我没有做任何分支。

编辑：

通过从dfc行中删除count() ，我得到了一个新的错误StreamingQueryException: Exception thrown in awaitResult来自query.awaitTermination()调用。 知道为什么count()不起作用以及为什么会出现新错误吗？

编辑2：

如果我直接登录到控制台而不运行 df 之后的所有中间查询，它就可以工作。 但是，每次我尝试运行其他查询时，都会引发StreamingQueryException 。

Answer 1

由于结构化流的性质，不可能以与静态数据帧相同的方式获得计数。 创建流时，Spark 正在使用触发器为新数据轮询源。 如果有任何 Spark 将其拆分为小数据帧（微批次）并沿流传递（转换、聚合、输出）。

如果您需要获取记录数，您可以添加一个侦听器以获取进度更新并获取onQueryProgress(QueryProgressEvent event)的输入数。

很难说为什么会收到StreamingQueryException因为filter()和withColumn()在结构化流中正常工作。 您是否在控制台中看到其他可能导致Exception thrown in awaitResult ？

顺便说一句，如果您在一个会话中有多个流，您应该使用spark.streams.awaitAnyTermination()来阻止，直到其中任何一个终止。

以下查询应该可以正常工作：

query = spark.readStream
    .option("header","true") \
    .schema(df_schema)\
    .csv(df_file)\
    .filter(df.col == 1)\
    .withColumn("new_col", udf_f(df2.some_col))\
    .writeStream\
    .format("console")\
    .outputMode("append")\
    .start()

query.awaitTermination()
# or spark.streams().awaitAnyTermination()

使用结构化流 (PySpark) 运行链式查询

问题描述

1 个解决方案

解决方案1
1 2018-03-05 23:54:11

使用结构化流 (PySpark) 运行链式查询

问题描述

1 个解决方案

解决方案1 1 2018-03-05 23:54:11

解决方案1
1 2018-03-05 23:54:11