简体   繁体   English

如何在Apache Spark中基于时间的两个数据集的联合?

[英]How to do a time based as-of-join of two datasets in apache spark?

Given two datasets of S and R both with a time column (t) as described below: 给定S和R的两个数据集都具有时间列(t),如下所述:

//snapshot with id at t
case class S(id: String, t: Int)

//reference data at t
case class R(t: Int, fk: String)

//Example test case
val ss: Dataset[S] = Seq(S("a", 1), S("a", 3), S("b", 5), S("b", 7))
      .toDS

    val rs: Dataset[R] = Seq(R(0, "a"), R(2, "a"), R(6, "b"))
      .toDS

    val srs: Dataset[(S, Option[R])] = ss
      .asOfJoin(rs)

    srs.collect() must contain theSameElementsAs
      Seq((S("a", 1), Some(R(0, "a"))), (S("a", 3), Some(R(2, "a"))), (S("b", 5), None), (S("b", 7), Some(R(6, "b"))))

Goal is to find the most recent row in R that matches E's id if possible ie R can be optional in the output. 目标是在R中找到与E的ID匹配的最新行,即R在输出中可以是可选的。

asOfJoin is defined as below: asOfJoin定义如下:

  implicit class SOps(ss: Dataset[S]) {
    def asOfJoin(rs: Dataset[R])(implicit spark: SparkSession): Dataset[(S, Option[R])] = ???
  }

One solution using Dataset API is as follows: 使用数据集API的一种解决方案如下:

def asOfJoin(rs: Dataset[R])(implicit spark: SparkSession): Dataset[(S, Option[R])] = {
      import spark.implicits._

      ss
        .joinWith(
          rs,
          ss("id") === rs("fk") && ss("t") >= rs("t"),
          "left_outer")
       .map { case (l, r) => (l, Option(r)) }
       .groupByKey { case (s, _) => s }
       .reduceGroups { (x, y) =>
         (x, y) match {
           case ((_, Some(R(tx, _))), (_, Some(R(ty, _)))) => if (tx > ty) x else y
           case _ => x
         }
       }
       .map { case (_, r) => r }
}

I'm not sure about the size of the dataset S and dataset R. But from your codes, I can see that the efficiency of the join(with unequal expressions) is bad, and I can give some suggestions based on different specific scenarios: 我不确定数据集S和数据集R的大小。但是从您的代码中,我可以看到join(使用不相等的表达式)的效率很差,并且我可以根据不同的特定情况给出一些建议:

Either Dataset R or Dataset S doesn't have too much data. 数据集R或数据集S都没有太多数据。

I suggest that you can broadcast the smaller dataset and finish the business logic in a spark udf with the help of broadcast variable. 我建议您可以广播较小的数据集,并借助广播变量在spark udf中完成业务逻辑。 In this way, you don't need the shuffle(join) process, which helps you save a lot of time and resources. 这样,您不需要shuffle(join)流程,这可以帮助您节省大量时间和资源。

For every unique id, count(distinct t) is not big. 对于每个唯一ID,count(distinct t)都不大。

I suggest that you can do a pre-aggregation by grouping id and collect_set(t) like this: 我建议您可以通过对id和collect_set(t)进行分组来进行预聚合,如下所示:

select id,collect_set(t) as t_set from S 

In this way, you can remove the unequal expression(ss("t") >= rs("t")) in the join. 这样,您可以删除联接中的不等式(ss(“ t”)> = rs(“ t”))。 And write your business logic with two t_sets from dataset S and dataset R. 并使用数据集S和数据集R中的两个t_set编写业务逻辑。

For other scenarios: 对于其他情况:

I suggest that you optimize your codes with a equal join and a window function. 我建议您使用相等的联接和窗口函数来优化代码。 Since I'm more familiar with SQL, I write SQL here, which can be transformed to dataset API: 由于我对SQL更加熟悉,因此我在此处编写了SQL,可以将其转换为数据集API:

select
  sid,
  st,
  rt
from
(
    select 
      S.id as sid,
      S.t as st,
      R.t as rt,
      row_number() over (partition by S.id order by (S.t - NVL(R.t, 0)) rn
    from
      S
    left join R on S.id = R.fk) tbl
where tbl.rn = 1

I took @bupt_ljy 's comment about avoiding a theta join and following seems to scale really well: 我接受了@bupt_ljy的有关避免theta联接的评论,并且跟随似乎可以很好地扩展:

def asOfJoin(rs: Dataset[R])(implicit spark: SparkSession): Dataset[(S, Option[R])] = {
  import spark.implicits._

  ss
    .joinWith(
      rs.sort(rs("fk"), rs("t")),
      ss("id") === rs("fk"),
      "left_outer")
    .map { case (l, r) => (l, Option(r)) }
    .groupByKey { case (s, _) => s }
    .flatMapGroups { (k, vs) =>
      new Iterator[(S, Option[R])] {
        private var didNotStart: Boolean = true

        override def hasNext: Boolean = didNotStart

        override def next(): (S, Option[R]) = {
          didNotStart = false
          vs
            .find { case (l, rOpt) =>
              rOpt match {
                case Some(r) => l.t >= r.t
                case _ => false
              }
            }.getOrElse((k, None))
        }
      }
    }
}

However, still super imperative code and there must be a better way... 但是,仍然是超级命令性代码,必须有更好的方法...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 如何通过在Scala Spark中键入键来连接两个数据集 - how to join two datasets by key in scala spark 如何在连接两个数据集后自动推断数据集? - How can spark infer dataset automatically after join of two datasets? 如何在 Scala 和 Apache Spark 中加入两个 DataFrame? - How to join two DataFrames in Scala and Apache Spark? Spark join / groupby数据集需要很多时间 - Spark join/groupby datasets take a lot time 使用 Spark SQL joinWith,如何连接两个数据集以根据日期将当前记录与其以前的记录相匹配? - Using Spark SQL joinWith, how can I join two datasets to match current records with their previous records based on date? Spark中两个大型数据集之间的交叉连接 - Cross join between two large datasets in Spark 如何在scala中加入两个数据集? - How to join two datasets in scala? org.apache.spark.sql.AnalysisException:流数据帧/数据集不支持非基于时间的 windows;; 尽管基于时间的 window - org.apache.spark.sql.AnalysisException: Non-time-based windows are not supported on streaming DataFrames/Datasets;; despite of time-based window 通过使用Scala Spark中的第一列联接两个数据集 - Join two datasets by using the first column in scala spark 在 Spark Structured Streaming 中外部连接两个数据集(不是数据帧) - Outer join two Datasets (not DataFrames) in Spark Structured Streaming
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM