简体   繁体   English

如何使用joinWith联接两个以上的数据集?

[英]How do I use joinWith to join more than 2 datasets?

I want to achieve something like this: 我想实现以下目标:

x.joinWith(y, x(id) === y(fid), "left_outer")
  .joinWith(z, x(id) === z(fid))
  .map(case {(x, y, z) => combineXYZ(x, y, z)})

When you use joinWith , What you get is a new Dataset of Tuple2 : (x, y) . 当使用joinWith ,得到的是Tuple2的新数据集: (x, y) So the column names are _1 and _2 . 因此,列名称为_1_2

So when you do your second join, you need to reference a column name from the tuple, not from one of the source dataset. 因此,当您进行第二次连接时,需要从元组而不是源数据集中的一个引用列名。 Like that : 像那样 :

x.joinWith(y, x(id) === y(fid), "left_outer").joinWith(z, $"_1.id" === z(fid))

Now, what you get is a tuple2 where first element is also a tuple : ((x, y), z) . 现在,您得到的是一个tuple2,其中第一个元素也是一个元组: ((x, y), z) So you must do your map like : 因此,您必须将地图绘制为:

.map(case {((x, y), z) => combineXYZ(x, y, z)})

This should work. 这应该工作。 Note that If you don't want to use $"_1.id , which is totally understandable, you can do a map after your first join, in order to create a new object, other than a tuple2, in order to get the correct column name. 请注意,如果您不想使用$"_1.id ,这是完全可以理解的,则可以在首次加入后进行映射,以创建除tuple2之外的新对象,以获得正确的列名。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM