简体   繁体   English

在 spark 中加入两个 RDD

[英]Join two RDD in spark

I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?我有两个 rdd 一个 rdd 只有一列,另外有两列连接键上的两个 RDD 我添加了 0 的虚拟值,还有其他有效的方法可以使用 join 吗?

val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")

val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()

Edit :编辑

Let me convert this question in SQL.让我用 SQL 转换这个问题。 Say for example I have table1 (moveid) and table2 (movieid,moviename) .比如说我有table1 (moveid)table2 (movieid,moviename) In SQL we write something like:在 SQL 中,我们编写如下内容:

select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid 
group by ....

here in SQL table1 has only one column where as table2 has two columns still the join works, same way in Spark can join on keys from both the RDD's.在 SQL 中, table1只有一列,而table2有两列, join仍然有效,Spark 中的相同方式可以连接来自两个 RDD 的键。

Join operation is defined only on PairwiseRDDs which are quite different from a relation / table in SQL.连接操作仅在与 SQL 中的关系/表完全不同的PairwiseRDDs上定义。 Each element of PairwiseRDD is a Tuple2 where the first element is the key and the second is value . PairwiseRDD的每个元素都是一个Tuple2 ,其中第一个元素是key ,第二个是value Both can contain complex objects as long as key provides a meaningful hashCode两者都可以包含复杂对象,只要key提供有意义的hashCode

If you want to think about this in a SQL-ish you can consider key as everything that goes to ON clause and value contains selected columns.如果您想在 SQL-ish 中考虑这一点,您可以将 key 视为进入ON子句的所有内容,并且value包含选定的列。

SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key

While these approaches look similar at first glance and you can express one using another there is one fundamental difference.虽然这些方法乍一看很相似,并且您可以使用另一种来表达一种方法,但它们之间存在一个根本区别。 When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while key and value in the PairwiseRDD have a clear meaning.当您查看 SQL 表并忽略约束时,所有列都属于同一类对象,而PairwiseRDD中的keyvalue具有明确的含义。

Going back to your problem to use join you need both key and value .回到你的问题来使用join你需要keyvalue Arguably much cleaner than using 0 as a placeholder would be to use null singleton but there is really no way around it.可以说比使用0作为占位符更干净的是使用null单例,但实际上没有办法解决它。

For small data you can use filter in a similar way to broadcast join:对于小数据,您可以以类似的方式使用过滤器来广播连接:

val moviesidBD = sc.broadcast(
  lines.map(x => x.split("\t")).map(_.head).collect.toSet)

movienames.filter{case (id, _) => moviesidBD.value contains id}

but if you really want SQL-ish joins then you should simply use SparkSQL.但如果你真的想要 SQL-ish 连接,那么你应该简单地使用 SparkSQL。

val movieIdsDf = lines
   .map(x => x.split("\t"))
   .map(a => Tuple1(a.head))
   .toDF("id")

val movienamesDf = movienames.toDF("id", "name")

// Add optional join type qualifier 
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))

On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. RDD上的Join操作只针对PairwiseRDDs定义,所以需要将值改为pairedRDD。 Below is a sample下面是一个示例

  val rdd1=sc.textFile("/data-001/part/")
  val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
  val rdd2=sc.textFile("/data-001/partsupp/")
  val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))

  rdd_1.join(rdd_2).take(2).foreach(println)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM