简体   繁体   English

Apache Spark Java 8中的自连接示例

[英]Self Join example in Apache Spark Java 8

I have a data set as below 我有如下数据集

Kolkata,30
Delhi,23
Lucknow,33
Lucknow,36
Delhi,31
Kolkata,34
Delhi,21
Kolkata,23

Want to do a self join to get a result set of the form. 想要进行自我连接以获得表单的结果集。

Lucknow -> (30, 36), Kolkata -> (30, 34), (34, 23), (23, 30).

How can this be achieved using Spark RDD. 如何使用Spark RDD实现。

JavaPairRDD<String, Integer> words = file.mapToPair(s ->{
    String[] temp = s.split(",");
    return new Tuple2<String, Integer>(temp[0], new Integer(temp[1]));
});        

JavaPairRDD<String, Iterable<Integer>> temp1 = words.groupByKey();        
JavaPairRDD<String, Iterable<Integer>> temp2 = words.groupByKey();            
JavaPairRDD<String, Tuple2<Iterable<Integer>, Iterable<Integer>>> words3 = temp2.join(temp1);

How to iterate now on the tuple? 现在如何在元组上进行迭代?

To get an RDD of <String, Iterable<Int,Int>> back use .groupByKey() after join. 要获得<String, Iterable<Int,Int>>的RDD,请在加入后使用.groupByKey()

JavaPairRDD<String, Iterable<Tuple2<Integer>>> result = 
      words.join(words).groupByKey()

It is possible I have the Java type wrong, but this is the right sequence of operations. 我可能有错误的Java类型,但这是正确的操作顺序。 I am more at ease with Python and Scala where the types for result RDDs don't need to be specified. 我对Python和Scala更加放心,因为它们不需要指定结果RDD的类型。 The Spark calls in these other languages perform the same data operations. 这些其他语言的Spark调用执行相同的数据操作。

Note: groupByKey() is known for being slow and resource intensive on large datasets 注意: groupByKey()以大型数据集上的资源groupByKey()缓慢而著称

After word.join(words) you should have an RDD with elements like: word.join(words)您应该具有包含以下内容的RDD:

 (Kolkata, (30,34))
 (Kolkata, (30,23))
 ...

And .groupByKey() groups together all of the values by key so that there is only one row for each key, like this: .groupByKey()按键将所有值分组在一起,因此每个键只有一行,如下所示:

 (Kolkata, [(30,34), (30,23), (34,23), ...])

For every unique key, a rdd self join yields to permutation pairs of values. 对于每个唯一键,rdd自连接产生值的排列对。

To remove the duplicate pairs, (Considering your data has no repeated values for a unique key), the filterduplicates() will remove those duplicate value pair from the set of permutations. 要删除重复对(考虑到您的数据没有唯一键的重复值),filterduplicates()将从排列集中删除那些重复值对。

//outside main function
type pair=(String,(Int,Int))

def filterduplicates(p:pair):Boolean={
return p._2._1<p._2._2
}    

//insidemain
val rdd=sc.textFile("../cities.txt")
val mapped=rdd.map(l=>l.split(",")).map(l=>(l(0),l(1).toInt))

val joined=mapped.join(mapped) 
//(Lucknow,(33,33)) (Lucknow,(33,36)) (Lucknow,(36,33)) (Lucknow,(36,36))

val grouped=joined.filter(filterduplicates)
//(Lucknow,(33,36))

val listt=grouped.groupByKey().mapValues(_.toList)
val finalresult=listt.collect()
finalresult.foreach(println)
//(Delhi,List((23,31), (21,23), (21,31)))(Kolkata,List((30,34),(23,30)  (23,34))) (Lucknow,List((33,36)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM