在 Spark Structured Streaming 中外部连接两个数据集（不是数据帧）

Question

I have some code that joins two streaming DataFrames and outputs to console.我有一些代码将两个流数据DataFrames连接DataFrames并输出到控制台。

val dataFrame1 =
  df1Input.withWatermark("timestamp", "40 seconds").as("A")

val dataFrame2 =
  df2Input.withWatermark("timestamp", "40 seconds").as("B")

val finalDF: DataFrame = dataFrame1.join(dataFrame2,
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDF.writeStream.format("console").start().awaitTermination()

What I now want is to refactor this part to use Datasets , so I can have some compile-time checking.我现在想要的是重构这部分以使用Datasets ，这样我就可以进行一些compile-time检查。

So what I tried was pretty straightforward:所以我尝试的非常简单：

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDS.writeStream.format("console").start().awaitTermination()

However, this gives the following error:但是，这会产生以下错误：

org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;; org.apache.spark.sql.AnalysisException：不支持两个流数据帧/数据集之间的流外连接，如果连接键中没有水印，或者可空端没有水印和适当的范围条件；；

As you can see, the join code hasn't changed, so there is a watermark on both sides and a range condition.可以看到， join代码没有改变，所以两边都有水印和范围条件。 The only change was to use the Dataset API instead of DataFrame .唯一的变化是使用Dataset API 而不是DataFrame 。

Also, it is fine when I use inner join :另外，当我使用内join时也很好：

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
          expr(
            "A.id = B.id" +
              " AND " +
              "B.timestamp >= A.timestamp " +
              " AND " +
              "B.timestamp <= A.timestamp + interval 1 hour")
          )
    finalDS.writeStream.format("console").start().awaitTermination()

Does anyone know how can this happen?有谁知道这怎么会发生？

Answer 1

Well, when you using joinWith method instead of join you rely on different implementation and it seems like this implementation not support leftOuter join for streaming Datasets. 好吧，当你使用joinWith方法而不是join你依赖于不同的实现，看起来这个实现不支持leftOuter join用于流式数据集。

You can check outer joins with watermarking section of the official documentation. 您可以使用官方文档的水印部分检查外部联接。 Method join not joinWith used. 方法join不使用joinWith 。 Note that result type will be DataFrame . 请注意，结果类型将是DataFrame 。 That means that you most likely will have to map field manually 这意味着您很可能必须手动映射字段

val finalDS = dataFrame1.as[A].join(dataFrame2.as[B],
    expr(
      "A.key = B.key" +
        " AND " +
        "B.timestamp >= A.timestamp " +
        " AND " +
        "B.timestamp <= A.timestamp + interval 1 hour"),
    joinType = "leftOuter").select(/* useful fields */).as[C]

Answer 2

If you here for understnding why this exception如果您在这里了解为什么会出现此异常

org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;

still aprears while you have introduced the watermark to the join and Spark 3 supports the streams join already, you probably have added watermarking AFTER the join, but Spark want you to add watermarking BEFORE the join on each stream!当您将水印引入连接并且 Spark 3 已经支持流连接时仍然存在，您可能在连接后添加了水印，但 Spark 希望您在每个流的连接之前添加水印！

在 Spark Structured Streaming 中外部连接两个数据集（不是数据帧）

问题描述

2 个解决方案

解决方案1
2 已采纳 2018-07-09 11:23:23

解决方案2
0 2021-11-19 09:05:46

在 Spark Structured Streaming 中外部连接两个数据集（不是数据帧）

问题描述

2 个解决方案

解决方案1 2 已采纳 2018-07-09 11:23:23

解决方案2 0 2021-11-19 09:05:46

解决方案1
2 已采纳 2018-07-09 11:23:23

解决方案2
0 2021-11-19 09:05:46