相当于SPARK中的左外连接

Question

Is there a left outer join equivalent in SPARK SCALA ? SPARK SCALA中是否有左外连接等效？ I understand there is join operation which is equivalent to database inner join. 我知道有一个连接操作，相当于数据库内连接。

Answer 1

Spark Scala does have the support of left outer join. Spark Scala确实支持左外连接。 Have a look here http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD 看看这里http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD

Usage is quite simple as 用法非常简单

rdd1.leftOuterJoin(rdd2)

Answer 2

它就像rdd1.leftOuterJoin(rdd2)一样简单，但你必须确保rdd的每一个元素都以（key，value）的形式出现。

Answer 3

Yes, there is. 就在这里。 Have a look at the DStream APIs and they have provided left as well as right outer joins. 看看DStream API ，它们提供了左外连接和右外连接。

If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like : 如果您有一个类型的流，我们说'记录'，并且您希望加入两个记录流，那么您可以这样做：

var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)

As the APIs say, the left and right streams have to be hash partitioned. 正如API所说，左右流必须进行散列分区。 ie, you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. 也就是说，您可以从记录中获取一些属性（或者可以以任何其他方式）来计算哈希值并将其转换为配对DStream。 left and right streams will be of type DStream[(Long, Record)] before you call that join function. left和right流将是类型的DStream[(Long, Record)]你把那个叫加入函数之前。 (It is just an example. The Hash type can be of some type other than Long as well.) （这只是一个例子.Hash类型也可以是除Long以外的某种类型。）

Answer 4

Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outer joins directly: Spark SQL / Data Frame API还直接支持LEFT / RIGHT / FULL 外连接：

https://spark.apache.org/docs/latest/sql-programming-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html

Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111 outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). 由于这个错误： https ：//issues.apache.org/jira/browse/SPARK-11111在1.6之前的Spark中的外连接可能会非常慢（除非您有非常小的数据集要加入）。 It used to use cartesian product and then filtering before 1.6. 它曾经使用过笛卡尔积，然后在1.6之前过滤。 Now it is using SortMergeJoin instead. 现在它正在使用SortMergeJoin。

相当于SPARK中的左外连接

问题描述

4 个解决方案

解决方案1
15 已采纳 2014-07-24 12:29:44

解决方案2
6 2015-07-13 22:44:06

解决方案3
3 2014-04-21 12:39:17

解决方案4
0 2016-02-01 00:25:31

相当于SPARK中的左外连接

问题描述

4 个解决方案

解决方案1 15 已采纳 2014-07-24 12:29:44

解决方案2 6 2015-07-13 22:44:06

解决方案3 3 2014-04-21 12:39:17

解决方案4 0 2016-02-01 00:25:31

解决方案1
15 已采纳 2014-07-24 12:29:44

解决方案2
6 2015-07-13 22:44:06

解决方案3
3 2014-04-21 12:39:17

解决方案4
0 2016-02-01 00:25:31