当一个数据集涉及聚合时如何连接两个流数据集

Question

I am getting below error in below code snippet -我在下面的代码片段中遇到以下错误 -

Exception in thread "main" org.apache.spark.sql.AnalysisException: Append output mode not supported when there are streaming aggregations on streaming DataFrames/DataSets without watermark;;线程“main” org.apache.spark.sql.AnalysisException 中的异常：当在没有水印的流数据帧/数据集上存在流聚合时，不支持追加输出模式；；

Below is my input schema下面是我的输入模式

val schema = new StructType()
.add("product",StringType)
.add("org",StringType)
.add("quantity", IntegerType)
.add("booked_at",TimestampType)

Creating streaming source dataset创建流源数据集

val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")

Creating another streaming dataframe where aggregation is done and then joining it with original source dataframe to filter out records创建另一个完成聚合的流数据帧，然后将其与原始源数据帧连接以过滤掉记录

payload_df.createOrReplaceTempView("orders")
    val stage_df = spark.sql("select org, product, max(booked_at) as booked_at from orders group by 1,2")
  stage_df.createOrReplaceTempView("stage")

  val total_qty = spark.sql(
    "select o.* from orders o join stage s on o.org = s.org and o.product = s.product and o.booked_at > s.booked_at ")

Finally, I was trying to display results on console with Append output mode.最后，我试图用追加输出模式在控制台上显示结果。 I am not able to figure out where I need to add watermark or how to resolve this.我不知道我需要在哪里添加水印或如何解决这个问题。 My objective is to filter out only those events in every trigger which have higher timestamp then the maximum timestamp received in any of the earlier triggers我的目标是仅过滤掉每个触发器中时间戳高于任何早期触发器中收到的最大时间戳的事件

total_qty
    .writeStream
    .format("console")
    .outputMode("append")
    .start()
    .awaitTermination()

Answer 1

With spark structured streaming you can make aggregation directly on stream only with watermark.使用 spark 结构化流，您可以仅使用水印直接在流上进行聚合。 If you have a column with the timestamp of the event you can do it like this:如果您有一列带有事件时间戳的列，您可以这样做：

val payload_df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "test1")
.option("startingOffsets", "latest")
.load()
.selectExpr("CAST(value AS STRING) as payload")
.select(from_json(col("payload"), schema).as("data"))
.select("data.*")
.withWatermark("event_time", "1 minutes")

On queries with aggregation you have 3 types of outputs:在聚合查询中，您有 3 种类型的输出：

Append mode uses watermark to drop old aggregation state.附加模式使用水印来删除旧的聚合状态。 But the output of a windowed aggregation is delayed the late threshold specified in withWatermark() as by the modes semantics, rows can be added to the Result Table only once after they are finalized (ie after watermark is crossed).但是窗口聚合的输出延迟了withWatermark()指定的延迟阈值，因为模式语义，行只能在最终确定后（即在越过水印后）添加到结果表一次。 See the Late Data section for more details.有关更多详细信息，请参阅延迟数据部分。
Update mode uses watermark to drop old aggregation state.更新模式使用水印来删除旧的聚合状态。
Complete mode does not drop old aggregation state since by definition this mode preserves all data in the Result Table.完整模式不会删除旧的聚合状态，因为根据定义，此模式会保留结果表中的所有数据。

Edit later: You have to add window on your groupBy method.稍后编辑：您必须在 groupBy 方法上添加窗口。 val aggFg = payload_df.groupBy( window($"event_time", "1 minute") , $"org", $"product") .agg(max(booked_at).as("booked_at")) val aggFg = payload_df.groupBy( window($"event_time", "1 minute") , $"org", $"product") .agg(max(booked_at).as("booked_at"))

当一个数据集涉及聚合时如何连接两个流数据集

问题描述

1 个解决方案

解决方案1
0 2020-02-19 11:40:11

当一个数据集涉及聚合时如何连接两个流数据集

问题描述

1 个解决方案

解决方案1 0 2020-02-19 11:40:11

解决方案1
0 2020-02-19 11:40:11