[英]Join two RDD in spark
I have two rdd one rdd have just one column other have two columns to join the two RDD on key's I have add dummy value which is 0 , is there any other efficient way of doing this using join ?我有两个 rdd 一个 rdd 只有一列,另外有两列连接键上的两个 RDD 我添加了 0 的虚拟值,还有其他有效的方法可以使用 join 吗?
val lines = sc.textFile("ml-100k/u.data")
val movienamesfile = sc.textFile("Cml-100k/u.item")
val moviesid = lines.map(x => x.split("\t")).map(x => (x(1),0))
val test = moviesid.map(x => x._1)
val movienames = movienamesfile.map(x => x.split("\\|")).map(x => (x(0),x(1)))
val shit = movienames.join(moviesid).distinct()
Edit :编辑:
Let me convert this question in SQL.让我用 SQL 转换这个问题。 Say for example I have
table1 (moveid)
and table2 (movieid,moviename)
.比如说我有
table1 (moveid)
和table2 (movieid,moviename)
。 In SQL we write something like:在 SQL 中,我们编写如下内容:
select moviename, movieid, count(1)
from table2 inner join table table1 on table1.movieid=table2.moveid
group by ....
here in SQL table1
has only one column where as table2
has two columns still the join
works, same way in Spark can join on keys from both the RDD's.在 SQL 中,
table1
只有一列,而table2
有两列, join
仍然有效,Spark 中的相同方式可以连接来自两个 RDD 的键。
Join operation is defined only on PairwiseRDDs
which are quite different from a relation / table in SQL.连接操作仅在与 SQL 中的关系/表完全不同的
PairwiseRDDs
上定义。 Each element of PairwiseRDD
is a Tuple2
where the first element is the key
and the second is value
. PairwiseRDD
的每个元素都是一个Tuple2
,其中第一个元素是key
,第二个是value
。 Both can contain complex objects as long as key
provides a meaningful hashCode
两者都可以包含复杂对象,只要
key
提供有意义的hashCode
If you want to think about this in a SQL-ish you can consider key as everything that goes to ON
clause and value
contains selected columns.如果您想在 SQL-ish 中考虑这一点,您可以将 key 视为进入
ON
子句的所有内容,并且value
包含选定的列。
SELECT table1.value, table2.value
FROM table1 JOIN table2 ON table1.key = table2.key
While these approaches look similar at first glance and you can express one using another there is one fundamental difference.虽然这些方法乍一看很相似,并且您可以使用另一种来表达一种方法,但它们之间存在一个根本区别。 When you look at the SQL table and you ignore constraints all columns belong in the same class of objects, while
key
and value
in the PairwiseRDD
have a clear meaning.当您查看 SQL 表并忽略约束时,所有列都属于同一类对象,而
PairwiseRDD
中的key
和value
具有明确的含义。
Going back to your problem to use join
you need both key
and value
.回到你的问题来使用
join
你需要key
和value
。 Arguably much cleaner than using 0
as a placeholder would be to use null
singleton but there is really no way around it.可以说比使用
0
作为占位符更干净的是使用null
单例,但实际上没有办法解决它。
For small data you can use filter in a similar way to broadcast join:对于小数据,您可以以类似的方式使用过滤器来广播连接:
val moviesidBD = sc.broadcast(
lines.map(x => x.split("\t")).map(_.head).collect.toSet)
movienames.filter{case (id, _) => moviesidBD.value contains id}
but if you really want SQL-ish joins then you should simply use SparkSQL.但如果你真的想要 SQL-ish 连接,那么你应该简单地使用 SparkSQL。
val movieIdsDf = lines
.map(x => x.split("\t"))
.map(a => Tuple1(a.head))
.toDF("id")
val movienamesDf = movienames.toDF("id", "name")
// Add optional join type qualifier
movienamesDf.join(movieIdsDf, movieIdsDf("id") <=> movienamesDf("id"))
On RDD Join operation is only defined for PairwiseRDDs, So need to change the value to pairedRDD. RDD上的Join操作只针对PairwiseRDDs定义,所以需要将值改为pairedRDD。 Below is a sample
下面是一个示例
val rdd1=sc.textFile("/data-001/part/")
val rdd_1=rdd1.map(x=>x.split('|')).map(x=>(x(0),x(1)))
val rdd2=sc.textFile("/data-001/partsupp/")
val rdd_2=rdd2.map(x=>x.split('|')).map(x=>(x(0),x(1)))
rdd_1.join(rdd_2).take(2).foreach(println)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.