Spark 1.5.1，Scala 2.10.5：如何扩展RDD [Array [String]，Vector]

Question

I am using Spark 1.5.1 with Scala 2.10.5 我在Scala 2.10.5中使用Spark 1.5.1

I have an RDD[Array[String], Vector] for each element of the RDD: 我对RDD[Array[String], Vector]的每个元素都有一个RDD[Array[String], Vector] ：

I want to take each String in the Array[String] and combine it with the Vector to create a tuple (String, Vector) , this step will lead to the creation of several tuples from each element of the initial RDD 我想将Array[String]中的每个String都与Vector结合起来以创建一个元组(String, Vector) ，此步骤将导致从初始RDD的每个元素创建多个元组

The goal is to end by building an RDD of tuples: RDD[(String, Vector)] , this RDD contains all the tuples created in the previous step. 目标是通过建立元组的RDD结束： RDD[(String, Vector)] ，此RDD包含上一步中创建的所有元组。

Thanks 谢谢

Answer 1

Consider this : 考虑一下：

rdd.flatMap { case (arr, vec) => arr.map( (s) => (s, vec) ) }

(The first flatMap lets you get a RDD[(String, Vector)] as an output as opposed to a map which would get you a RDD[Array[(String, Vector)]] ) （第一个flatMap允许您获得RDD[(String, Vector)]作为输出，而不是map ，而获得RDD[Array[(String, Vector)]] ）

Answer 2

Have you tried this? 你有尝试过吗？

// rdd: RDD[Array[String], Vector] - initial RDD
val new_rdd = rdd
  .flatMap {
    case (array: Array[String], vec: Vector) => array.map(str => (str, vec))
  }

Toy example (I'm running it in spark-shell): 玩具示例（我在spark-shell中运行它）：

val rdd = sc.parallelize(Array((Array("foo", "bar"), 100), (Array("one", "two"), 200)))
val new_rdd = rdd
  .map {
    case (array: Array[String], vec: Int) => array.map(str => (str, vec))
  }
  .flatMap(arr => arr)
new_rdd.collect
res14: Array[(String, Int)] = Array((foo,100), (bar,100), (one,200), (two,200))

Spark 1.5.1，Scala 2.10.5：如何扩展RDD [Array [String]，Vector]

问题描述

2 个解决方案

解决方案1
3 已采纳 2015-11-05 12:28:28

解决方案2
1 2015-11-05 12:17:20

Spark 1.5.1，Scala 2.10.5：如何扩展RDD [Array [String]，Vector]

问题描述

2 个解决方案

解决方案1 3 已采纳 2015-11-05 12:28:28

解决方案2 1 2015-11-05 12:17:20

解决方案1
3 已采纳 2015-11-05 12:28:28

解决方案2
1 2015-11-05 12:17:20