如何通过Spark RDD中的键连接两个哈希图

Question

I have two RDD each in the format of 我有两个RDD，格式分别为

{string, HashMap[long,object]} {string，HashMap [long，object]}

I want to perform a join operation on them such that the hashmap of the same key get merge in scala. 我想对它们执行联接操作，以使同一键的哈希图在scala中合并。

RDD1-> {string1,HashMap[{long a,object},{long b,object}]
RDD2-> {string1,HashMap[{long c,object}]

After joining the two RDD, it should be like 加入两个RDD之后，应该像

RDD->{string1,HashMap[{long a,object},{long b,object},{long c,object}]

Any help will be appreciated, also I am kind of new to scala and spark. 任何帮助将不胜感激，我也是scala和spark的新手。

Answer 1

Update : a simpler way is just to take the union and then reduce by key: 更新：一种更简单的方法是采用联合然后按键减少：

(rdd1 union rdd2).reduceByKey(_++_)

Older solution, just for reference . 较旧的解决方案，仅供参考 。 This can also be done by cogroup , which collects values for keys in one or both RDDs (whereas join will omit values that only have a key in one of the original RDDs). 这也可以通过完成cogroup ，收集值在一个或两个RDDS键（而join将忽略，只有在原RDDS的一个关键值）。 See the ScalaDoc . 参见ScalaDoc 。

We then concatenate the lists of values using ++ to form a single list of values, and finally reduce the values (Maps) to a single Map. 然后，我们使用++连接值列表以形成一个值列表，最后将值（Maps） reduce为一个Map。

The last two steps can be combined into a single mapValues operation: 最后两个步骤可以合并为一个mapValues操作：

Using this data... 使用此数据...

val rdd1 = sc.parallelize(List("a"->Map(1->"one", 2->"two")))
val rdd2 = sc.parallelize(List("a"->Map(3->"three")))

...in the spark shell: ...在火花壳中：

val x = (rdd1 cogroup rdd2).mapValues{ case (a,b) => (a ++ b).reduce(_++_)}

x foreach println

> (a,Map(1 -> one, 2 -> two, 3 -> three))

Answer 2

You can do by joining the two RDDs and applying a merge function to the tuples of maps: 您可以通过将两个RDD合并并将合并函数应用于地图的元组来实现：

def join[W](other: RDD[(K, W)], numSplits: Int): RDD[(K, (V, W))] Return an RDD containing all pairs of elements with matching keys in this and other. def join [W]（other：RDD [（K，W）]，numSplits：Int）：RDD [（K，（V，W））]返回一个RDD，其中包含所有在这两个元素中具有匹配键的元素对。 Each pair of elements will be returned as a (k, (v1, v2)) tuple, where (k, v1) is in this and (k, v2) is in other. 每对元素将作为（k，（v1，v2））元组返回，其中（k，v1）在其中，而（k，v2）在另一个中。 Performs a hash join across the cluster. 在整个集群中执行哈希联接。

def mapValues[U](f: (V) ⇒ U): RDD[(K, U)] Pass each value in the key-value pair RDD through a map function without changing the keys; def mapValues [U]（f：（V）⇒U）：RDD [（K，U）]通过映射函数传递键-值对RDD中的每个值，而无需更改键； this also retains the original RDD's partitioning. 这也保留了原始RDD的分区。

assume, there is a function merge like discussed in Best way to merge two maps and sum the values of same key? 假设有一个函数合并，如最佳方式中所述，可以合并两个映射并求和相同键的值？

def [K] merge(a:K,b:K):K = ???

could be like 可能像

def merge(a:Map[K,V],b:Map[K,V]) = a ++ b

given that, the RDDs can be joined first 鉴于此，可以先加入RDD

val joined = RDD1.join(RDD2)

and then mapped 然后映射

val mapped = joined.mapValues( v => merge(v._1,v._2))

The result is an RDD with (Key, the merged Map).. 结果是带有（键，合并的地图）的RDD。

如何通过Spark RDD中的键连接两个哈希图

问题描述

2 个解决方案

解决方案1
3 2015-03-26 13:44:52

解决方案2
1 已采纳 2015-03-26 06:43:49

如何通过Spark RDD中的键连接两个哈希图

问题描述

2 个解决方案

解决方案1 3 2015-03-26 13:44:52

解决方案2 1 已采纳 2015-03-26 06:43:49

解决方案1
3 2015-03-26 13:44:52

解决方案2
1 已采纳 2015-03-26 06:43:49