简体   繁体   English

使用Spark数据集在Scala中执行类型化连接

[英]Perform a typed join in Scala with Spark Datasets

I like Spark Datasets as they give me analysis errors and syntax errors at compile time and also allow me to work with getters instead of hard-coded names/numbers. 我喜欢Spark数据集,因为它们在编译时给我分析错误和语法错误,并且允许我使用getter而不是硬编码的名称/数字。 Most computations can be accomplished with Dataset's high-level APIs. 大多数计算都可以使用Dataset的高级API完成。 For example, it's much simpler to perform agg, select, sum, avg, map, filter, or groupBy operations by accessing a Dataset typed object's than using RDD rows' data fields. 例如,通过访问数据集类型对象而不是使用RDD行的数据字段来执行agg,select,sum,avg,map,filter或groupBy操作要简单得多。

However the join operation is missing from this, I read that I can do a join like this 但是,由于缺少连接操作,我读到我可以像这样进行连接

ds1.joinWith(ds2, ds1.toDF().col("key") === ds2.toDF().col("key"), "inner")

But that is not what I want as I would prefer to do it via the case class interface, so something more like this 但这不是我想要的,因为我更喜欢通过case类接口来做,所以更像这样的东西

ds1.joinWith(ds2, ds1.key === ds2.key, "inner")

The best alternative for now seems to create an object next to the case class and give this functions to provide me with the right column name as a String. 现在最好的替代方法似乎是在case类旁边创建一个对象,并给这个函数提供正确的列名作为String。 So I would use the first line of code but put a function instead of a hard-coded column name. 所以我会使用第一行代码但是放置一个函数而不是硬编码的列名。 But that doesn't feel elegant enough.. 但那感觉不够优雅..

Can someone advise me on other options here? 有人可以告诉我其他选项吗? The goal is to have an abstraction from the actual column names and work preferably via the getters of the case class. 目标是从实际的列名中抽象出来,最好通过case类的getter工作。

I'm using Spark 1.6.1 and Scala 2.10 我正在使用Spark 1.6.1和Scala 2.10

Observation 意见

Spark SQL can optimize join only if join condition is based on the equality operator. 仅当连接条件基于相等运算符时,Spark SQL才能优化连接。 This means we can consider equijoins and non-equijoins separately. 这意味着我们可以分别考虑等量连接和非等量连接。

Equijoin 等值连接

Equijoin can be implemented in a type safe manner by mapping both Datasets to (key, value) tuples, performing join based on keys, and reshaping the result: 通过将Datasets映射到(键,值)元组,基于键执行连接以及重新整形结果,可以以类型安全的方式实现Equijoin:

import org.apache.spark.sql.Encoder
import org.apache.spark.sql.Dataset

def safeEquiJoin[T, U, K](ds1: Dataset[T], ds2: Dataset[U])
    (f: T => K, g: U => K)
    (implicit e1: Encoder[(K, T)], e2: Encoder[(K, U)], e3: Encoder[(T, U)]) = {
  val ds1_ = ds1.map(x => (f(x), x))
  val ds2_ = ds2.map(x => (g(x), x))
  ds1_.joinWith(ds2_, ds1_("_1") === ds2_("_1")).map(x => (x._1._2, x._2._2))
}

Non-equijoin 非等值连接

Can be expressed using relational algebra operators as R ⋈θ S = σθ(R × S) and converted directly to code. 可以使用关系代数运算符表示为R⋈θS=σθ(R×S)并直接转换为代码。

Spark 2.0 Spark 2.0

Enable crossJoin and use joinWith with trivially equal predicate: 启用crossJoin并使用joinWith与简单相等的谓词:

spark.conf.set("spark.sql.crossJoin.enabled", true)

def safeNonEquiJoin[T, U](ds1: Dataset[T], ds2: Dataset[U])
                         (p: (T, U) => Boolean) = {
  ds1.joinWith(ds2, lit(true)).filter(p.tupled)
}

Spark 2.1 Spark 2.1

Use crossJoin method: 使用crossJoin方法:

def safeNonEquiJoin[T, U](ds1: Dataset[T], ds2: Dataset[U])
    (p: (T, U) => Boolean)
    (implicit e1: Encoder[Tuple1[T]], e2: Encoder[Tuple1[U]], e3: Encoder[(T, U)]) = {
  ds1.map(Tuple1(_)).crossJoin(ds2.map(Tuple1(_))).as[(T, U)].filter(p.tupled)
}

Examples 例子

case class LabeledPoint(label: String, x: Double, y: Double)
case class Category(id: Long, name: String)

val points1 = Seq(LabeledPoint("foo", 1.0, 2.0)).toDS
val points2 = Seq(
  LabeledPoint("bar", 3.0, 5.6), LabeledPoint("foo", -1.0, 3.0)
).toDS
val categories = Seq(Category(1, "foo"), Category(2, "bar")).toDS

safeEquiJoin(points1, categories)(_.label, _.name)
safeNonEquiJoin(points1, points2)(_.x > _.x)

Notes 笔记

  • It should be noted that these methods are qualtiatively differnt from a direct joinWith application and require expensive DeserializeToObject / SerializeFromObject transformations (compared to that direct joinWith can use logical operations on the data). 应该注意的是,这些方法与直接joinWith应用程序在质量上有所不同,并且需要昂贵的DeserializeToObject / SerializeFromObject转换(与直接joinWith相比,可以对数据使用逻辑运算)。

    This is similar to the behavior described in Spark 2.0 Dataset vs DataFrame . 这类似于Spark 2.0 Dataset vs DataFrame中描述的行为。

  • If you're not limited to the Spark SQL API frameless provides interesting type safe extensions for Datasets (as of today its supports only Spark 2.0): 如果您不限于Spark SQL API, framelessDatasets提供了有趣的类型安全扩展(截至今天它仅支持Spark 2.0):

     import frameless.TypedDataset val typedPoints1 = TypedDataset.create(points1) val typedPoints2 = TypedDataset.create(points2) typedPoints1.join(typedPoints2, typedPoints1('x), typedPoints2('x)) 
  • Dataset API is not stable in 1.6 so I don't think it makes sense to use it there. Dataset API在1.6中不稳定,所以我认为在那里使用它并不合理。

  • Of course this design and descriptive names are not necessary. 当然,这种设计和描述性名称不是必需的。 You can easily use type class to add this methods implicitly to Dataset an there is no conflict with built in signatures so both can be called joinWith . 您可以轻松地使用类型类隐式地将此方法添加到Dataset ,并且与内置签名没有冲突,因此两者都可以称为joinWith

Also, another bigger problem for not type safe Spark API is that when you join two Datasets , it will give you a DataFrame . 此外,非类型安全Spark API的另一个更大问题是,当您加入两个Datasets ,它将为您提供一个DataFrame And then you lose types from your original two datasets. 然后你丢失原始两个数据集中的类型。

val a: Dataset[A]
val b: Dataset[B]

val joined: Dataframe = a.join(b)
// what would be great is 
val joined: Dataset[C] = a.join(b)(implicit func: (A, B) => C)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM