两个HashMap的交集（HashMap <Integer,HashSet<Integer> >）Scala中的RDD for Spark

Question

I am working in Scala for programming in Spark on a Standalone machine (PC having Windows 10). 我在Scala中工作，用于在独立计算机（具有Windows 10的PC）上的Spark中进行编程。 I am a newbie and don't have experience in programming in scala and spark. 我是新手，没有使用Scala和Spark进行编程的经验。 So I will be very thankful for the help. 因此，我将非常感谢您的帮助。

Problem: 问题：

I have a HashMap, hMap1, whose values are HashSets of Integer entries (HashMap>). 我有一个HashMap，hMap1，其值是Integer条目的HashSets（HashMap>）。 I then store its values (ie, many HashSet values) in an RDD. 然后，我将其值（即，许多HashSet值）存储在RDD中。 The code is as below 代码如下

val rdd1 = sc.parallelize(Seq(hMap1.values()))

Now I have another HashMap, hMap2, of same type ie, HashMap>. 现在，我有另一个相同类型的HashMap，hMap2，即HashMap>。 Its values are also stored in an RDD as 其值也存储在RDD中

val rdd2 = sc.parallelize(Seq(hMap2.values()))

I want to know how can I intersect the values of hMap1 and hMap2 我想知道如何将hMap1和hMap2的值相交

For example: 例如：

Input: 输入：

the data in rdd1 = [2, 3], [1, 109], [88, 17] rdd1 = [2, 3], [1, 109], [88, 17]的数据rdd1 = [2, 3], [1, 109], [88, 17]

and data in rdd2 = [2, 3], [1, 109], [5,45] 并且rdd2 = [2, 3], [1, 109], [5,45]数据rdd2 = [2, 3], [1, 109], [5,45]

Output 输出量

so the output = [2, 3], [1, 109] 所以输出= [2, 3], [1, 109]

Answer 1

Problem statement 问题陈述

My understanding of your question is the following: 我对您的问题的理解如下：

Given two RDDs of type RDD[Set[Integer]] , how can I produce an RDD of their common records. 给定两个类型为RDD[Set[Integer]]的RDD ，如何生成其公共记录的RDD 。

Sample data 样本数据

Two RDDs generated by 由生成两个RDD

val rdd1 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(88, 17)))
val rdd2 = sc.parallelize(Seq(Set(2, 3), Set(1, 109), Set(5, 45)))

Possible solution 可能的解决方案

If my understanding of the problem statement is correct, you could use rdd1.intersection(rdd2) if your RDDs are as I thought. 如果我对问题说明的理解是正确的，则可以使用rdd1.intersection(rdd2) 如果您的RDD符合我的想法）。 This is what I tried on a spark-shell with Spark 2.2.0: 这是我在Spark 2.2.0的spark-shell上尝试过的方法：

rdd1.intersection(rdd2).collect

which yielded the output: 产生了输出：

Array(Set(2, 3), Set(1, 109))

This works because Spark can compare elements of type Set[Integer] , but note that this is not generalisable to any object Set[MyObject] unless you defined the equality contract of MyObject . 这是可行的，因为Spark可以比较Set[Integer]类型的元素，但是请注意，除非您定义了MyObject的相等协定，否则这不能推广到任何对象Set[MyObject] 。

两个HashMap的交集（HashMap <Integer,HashSet<Integer> >）Scala中的RDD for Spark

问题描述

1 个解决方案

解决方案1
0 2017-11-12 12:27:36

两个HashMap的交集（HashMap <Integer,HashSet<Integer> &gt;）Scala中的RDD for Spark

问题描述

1 个解决方案

解决方案1 0 2017-11-12 12:27:36

两个HashMap的交集（HashMap <Integer,HashSet<Integer> >）Scala中的RDD for Spark

解决方案1
0 2017-11-12 12:27:36