简体   繁体   English

在 Spark Structured Streaming 中外部连接两个数据集(不是数据帧)

[英]Outer join two Datasets (not DataFrames) in Spark Structured Streaming

I have some code that joins two streaming DataFrames and outputs to console.我有一些代码将两个流数据DataFrames连接DataFrames并输出到控制台。

val dataFrame1 =
  df1Input.withWatermark("timestamp", "40 seconds").as("A")

val dataFrame2 =
  df2Input.withWatermark("timestamp", "40 seconds").as("B")

val finalDF: DataFrame = dataFrame1.join(dataFrame2,
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDF.writeStream.format("console").start().awaitTermination()

What I now want is to refactor this part to use Datasets , so I can have some compile-time checking.我现在想要的是重构这部分以使用Datasets ,这样我就可以进行一些compile-time检查。

So what I tried was pretty straightforward:所以我尝试的非常简单:

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
      expr(
        "A.id = B.id" +
          " AND " +
          "B.timestamp >= A.timestamp " +
          " AND " +
          "B.timestamp <= A.timestamp + interval 1 hour")
      , joinType = "leftOuter")
finalDS.writeStream.format("console").start().awaitTermination()

However, this gives the following error:但是,这会产生以下错误:

org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;; org.apache.spark.sql.AnalysisException:不支持两个流数据帧/数据集之间的流外连接,如果连接键中没有水印,或者可空端没有水印和适当的范围条件;;

As you can see, the join code hasn't changed, so there is a watermark on both sides and a range condition.可以看到, join代码没有改变,所以两边都有水印和范围条件。 The only change was to use the Dataset API instead of DataFrame .唯一的变化是使用Dataset API 而不是DataFrame

Also, it is fine when I use inner join :另外,当我使用内join时也很好:

val finalDS: Dataset[(A,B)] = dataFrame1.as[A].joinWith(dataFrame2.as[B],
          expr(
            "A.id = B.id" +
              " AND " +
              "B.timestamp >= A.timestamp " +
              " AND " +
              "B.timestamp <= A.timestamp + interval 1 hour")
          )
    finalDS.writeStream.format("console").start().awaitTermination()

Does anyone know how can this happen?有谁知道这怎么会发生?

Well, when you using joinWith method instead of join you rely on different implementation and it seems like this implementation not support leftOuter join for streaming Datasets. 好吧,当你使用joinWith方法而不是join你依赖于不同的实现,看起来这个实现不支持leftOuter join用于流式数据集。

You can check outer joins with watermarking section of the official documentation. 您可以使用官方文档的水印部分检查外部联接 Method join not joinWith used. 方法join不使用joinWith Note that result type will be DataFrame . 请注意,结果类型将是DataFrame That means that you most likely will have to map field manually 这意味着您很可能必须手动映射字段

val finalDS = dataFrame1.as[A].join(dataFrame2.as[B],
    expr(
      "A.key = B.key" +
        " AND " +
        "B.timestamp >= A.timestamp " +
        " AND " +
        "B.timestamp <= A.timestamp + interval 1 hour"),
    joinType = "leftOuter").select(/* useful fields */).as[C]

If you here for understnding why this exception如果您在这里了解为什么会出现此异常

org.apache.spark.sql.AnalysisException: Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition;;

still aprears while you have introduced the watermark to the join and Spark 3 supports the streams join already, you probably have added watermarking AFTER the join, but Spark want you to add watermarking BEFORE the join on each stream!当您将水印引入连接并且 Spark 3 已经支持流连接时仍然存在,您可能在连接后添加了水印,但 Spark 希望您在每个流的连接之前添加水印!

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM