[英]Intersection of Two HashMap (HashMap<Integer,HashSet<Integer>>) RDDs in Scala for Spark
I am working in Scala for programming in Spark on a Standalone machine (PC having Windows 10). 我在Scala中工作,用于在独立计算机(具有Windows 10的PC)上的Spark中进行编程。 I am a newbie and don't have experience in programming in scala and spark.
我是新手,没有使用Scala和Spark进行编程的经验。 So I will be very thankful for the help.
因此,我将非常感谢您的帮助。
Problem: 问题:
I have a HashMap, hMap1, whose values are HashSets of Integer entries (HashMap>). 我有一个HashMap,hMap1,其值是Integer条目的HashSets(HashMap>)。 I then store its values (ie, many HashSet values) in an RDD.
然后,我将其值(即,许多HashSet值)存储在RDD中。 The code is as below
代码如下
val rdd1 = sc.parallelize(Seq(hMap1.values()))
Now I have another HashMap, hMap2, of same type ie, HashMap>. 现在,我有另一个相同类型的HashMap,hMap2,即HashMap>。 Its values are also stored in an RDD as
其值也存储在RDD中
val rdd2 = sc.parallelize(Seq(hMap2.values()))
I want to know how can I intersect the values of hMap1 and hMap2 我想知道如何将hMap1和hMap2的值相交
For example: 例如:
Input: 输入:
the data in rdd1 = [2, 3], [1, 109], [88, 17]
rdd1 = [2, 3], [1, 109], [88, 17]
的数据rdd1 = [2, 3], [1, 109], [88, 17]
and data in rdd2 = [2, 3], [1, 109], [5,45]
并且
rdd2 = [2, 3], [1, 109], [5,45]
数据rdd2 = [2, 3], [1, 109], [5,45]
Output 输出量
so the output = [2, 3], [1, 109]
所以输出=
[2, 3], [1, 109]
Problem statement 问题陈述
My understanding of your question is the following: 我对您的问题的理解如下:
Given two RDDs of type
RDD[Set[Integer]]
, how can I produce anRDD
of their common records.给定两个类型为
RDD[Set[Integer]]
的RDD
,如何生成其公共记录的RDD
。
Sample data 样本数据
Two RDDs generated by 由生成两个RDD
val rdd1 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(88, 17)))
val rdd2 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(5, 45)))
Possible solution 可能的解决方案
If my understanding of the problem statement is correct, you could use rdd1.intersection(rdd2)
if your RDDs are as I thought. 如果我对问题说明的理解是正确的,则可以使用
rdd1.intersection(rdd2)
如果您的RDD符合我的想法)。 This is what I tried on a spark-shell with Spark 2.2.0: 这是我在Spark 2.2.0的spark-shell上尝试过的方法:
rdd1.intersection(rdd2).collect
which yielded the output: 产生了输出:
Array(Set(2, 3), Set(1, 109))
This works because Spark can compare elements of type Set[Integer]
, but note that this is not generalisable to any object Set[MyObject]
unless you defined the equality contract of MyObject
. 这是可行的,因为Spark可以比较
Set[Integer]
类型的元素,但是请注意,除非您定义了MyObject
的相等协定 ,否则这不能推广到任何对象Set[MyObject]
。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.