如何使用joinWith联接两个以上的数据集？

Question

I want to achieve something like this: 我想实现以下目标：

x.joinWith(y, x(id) === y(fid), "left_outer")
  .joinWith(z, x(id) === z(fid))
  .map(case {(x, y, z) => combineXYZ(x, y, z)})

Answer 1

When you use joinWith , What you get is a new Dataset of Tuple2 : (x, y) . 当使用joinWith ，得到的是Tuple2的新数据集： (x, y) 。 So the column names are _1 and _2 . 因此，列名称为_1和_2 。

So when you do your second join, you need to reference a column name from the tuple, not from one of the source dataset. 因此，当您进行第二次连接时，需要从元组而不是源数据集中的一个引用列名。 Like that : 像那样：

x.joinWith(y, x(id) === y(fid), "left_outer").joinWith(z, $"_1.id" === z(fid))

Now, what you get is a tuple2 where first element is also a tuple : ((x, y), z) . 现在，您得到的是一个tuple2，其中第一个元素也是一个元组： ((x, y), z) 。 So you must do your map like : 因此，您必须将地图绘制为：

.map(case {((x, y), z) => combineXYZ(x, y, z)})

This should work. 这应该工作。 Note that If you don't want to use $"_1.id , which is totally understandable, you can do a map after your first join, in order to create a new object, other than a tuple2, in order to get the correct column name. 请注意，如果您不想使用$"_1.id ，这是完全可以理解的，则可以在首次加入后进行映射，以创建除tuple2之外的新对象，以获得正确的列名。

如何使用joinWith联接两个以上的数据集？

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-08-13 08:36:29

如何使用joinWith联接两个以上的数据集？

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-08-13 08:36:29

解决方案1
2 已采纳 2019-08-13 08:36:29