[英]spark: join rdd based on sequence of another rdd
I have an rdd say sample_rdd
of type RDD[(String, String, Int))]
with 3 columns id,item,count. 我有一个RDD[(String, String, Int))]
说sample_rdd
类型为RDD[(String, String, Int))]
具有3列id,item,count。 sample data: 样本数据:
id1|item1|1 id1|item2|3 id1|item3|4 id2|item1|3 id2|item4|2
I want to join each id against a lookup_rdd
this: 我想加入每个id的lookup_rdd
:
item1|0 item2|0 item3|0 item4|0 item5|0
The output should give me following for id1, outerjoin with lookuptable: 输出应为我提供以下关于lookuptable的id1,outerjoin的信息:
item1|1 item2|3 item3|4 item4|0 item5|0
Similarly for id2 i should get: 同样对于id2我应该得到:
item1|3 item2|0 item3|0 item4|2 item5|0
Finally output for each id should have all counts with id: 最后,每个id的输出应包含id的所有计数:
id1,1,3,4,0,0 id2,3,0,0,2,0
IMPORTANT:this output should be always ordered according to the order in lookup 重要说明:此输出应始终根据查找顺序进行排序
This is what i have tried: 这是我尝试过的:
val line = rdd_sample.map { case (id, item, count) => (id, (item,count)) }.map(row=>(row._1,row._2)).groupByKey() get(line).map(l=>(l._1,l._2)).mapValues(item_count=>lookup_rdd.leftOuterJoin(item_count))
def get (line: RDD[(String, Iterable[(String, Int)])]) = { for{ (id, item_cnt) <- line i = item_cnt.map(tuple => (tuple._1,tuple._2)) } yield (id,i)
Try below. 请尝试以下。 Run each step on your local console to understand whats happening in detail. 在本地控制台上运行每个步骤,以了解详细情况。
The idea is to zipwithindex and form seq based on lookup_rdd. 这个想法是zipwithindex并基于lookup_rdd形成seq。 (i1,0),(i2,1)..(i5,4)
and (id1,0),(id2,1)
(i1,0),(i2,1)..(i5,4)
和(id1,0),(id2,1)
Index of final result wanted = [delta(length of lookup_rdd seq) * index of id1..id2 ] + index of i1...i5
So the base seq generated will be (0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
因此生成的基本序列将是(0,(i1,id1)),(1,(i2,id1))...(8,(i4,id2)),(9,(i5,id2))
and then based on the key (i1,id1)
reduce and calculate count. 然后根据键(i1,id1)
减少并计算计数。
val res2 = sc.parallelize(arr) //sample_rdd
val res3 = sc.parallelize(cart) //lookup_rdd
val delta = res3.count
val res83 = res3.map(_._1).zipWithIndex.cartesian(res2.map(_._1).distinct.zipWithIndex).map(x => (((x._1._1,x._2._1),((delta * x._2._2) + x._1._2, 0)))
val res86 = res2.map(x => ((x._2,x._1),x._3)).reduceByKey(_+_)
val res88 = res83.leftOuterJoin(res86)
val res91 = res88.map( x => {
x._2._2 match {
case Some(x1) => (x._2._1._1, (x._1,x._2._1._2+x1))
case None => (x._2._1._1, (x._1,x._2._1._2))
}
})
val res97 = res91.sortByKey(true).map( x => {
(x._2._1._2,List(x._2._2))}).reduceByKey(_++_)
res97.collect
// SOLUTION: Array((id1,List(1,3,4,0,0)),(id2,List(3,0,0,2,0)))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.