简体   繁体   English

相当于SPARK中的左外连接

[英]Equivalent to left outer join in SPARK

Is there a left outer join equivalent in SPARK SCALA ? SPARK SCALA中是否有左外连接等效? I understand there is join operation which is equivalent to database inner join. 我知道有一个连接操作,相当于数据库内连接。

Spark Scala does have the support of left outer join. Spark Scala确实支持左外连接。 Have a look here http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD 看看这里http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.api.java.JavaPairRDD

Usage is quite simple as 用法非常简单

rdd1.leftOuterJoin(rdd2)

它就像rdd1.leftOuterJoin(rdd2)一样简单,但你必须确保rdd的每一个元素都以(key,value)的形式出现。

Yes, there is. 就在这里。 Have a look at the DStream APIs and they have provided left as well as right outer joins. 看看DStream API ,它们提供了左外连接和右外连接。

If you have a stream of of type let's say 'Record', and you wish to join two streams of records, then you can do this like : 如果您有一个类型的流,我们说'记录',并且您希望加入两个记录流,那么您可以这样做:

var res: DStream[(Long, (Record, Option[Record]))] = left.leftOuterJoin(right)

As the APIs say, the left and right streams have to be hash partitioned. 正如API所说,左右流必须进行散列分区。 ie, you can take some attributes from a Record, (or may be in any other way) to calculate a Hash value and convert it to pair DStream. 也就是说,您可以从记录中获取一些属性(或者可以以任何其他方式)来计算哈希值并将其转换为配对DStream。 left and right streams will be of type DStream[(Long, Record)] before you call that join function. leftright流将是类型的DStream[(Long, Record)]你把那个叫加入函数之前。 (It is just an example. The Hash type can be of some type other than Long as well.) (这只是一个例子.Hash类型也可以是除Long以外的某种类型。)

Spark SQL / Data Frame API also supports LEFT/RIGHT/FULL outer joins directly: Spark SQL / Data Frame API还直接支持LEFT / RIGHT / FULL 连接:

https://spark.apache.org/docs/latest/sql-programming-guide.html https://spark.apache.org/docs/latest/sql-programming-guide.html

Because of this bug: https://issues.apache.org/jira/browse/SPARK-11111 outer joins in Spark prior to 1.6 might be very slow (unless you have really small data sets to join). 由于这个错误: https ://issues.apache.org/jira/browse/SPARK-11111在1.6之前的Spark中的外连接可能会非常慢(除非您有非常小的数据集要加入)。 It used to use cartesian product and then filtering before 1.6. 它曾经使用过笛卡尔积,然后在1.6之前过滤。 Now it is using SortMergeJoin instead. 现在它正在使用SortMergeJoin。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM