简体   繁体   中英

Spark 1.5.1, Scala 2.10.5: how to expand an RDD[Array[String], Vector]

I am using Spark 1.5.1 with Scala 2.10.5

I have an RDD[Array[String], Vector] for each element of the RDD:

  • I want to take each String in the Array[String] and combine it with the Vector to create a tuple (String, Vector) , this step will lead to the creation of several tuples from each element of the initial RDD

The goal is to end by building an RDD of tuples: RDD[(String, Vector)] , this RDD contains all the tuples created in the previous step.

Thanks

Consider this :

rdd.flatMap { case (arr, vec) => arr.map( (s) => (s, vec) ) }

(The first flatMap lets you get a RDD[(String, Vector)] as an output as opposed to a map which would get you a RDD[Array[(String, Vector)]] )

Have you tried this?

// rdd: RDD[Array[String], Vector] - initial RDD
val new_rdd = rdd
  .flatMap {
    case (array: Array[String], vec: Vector) => array.map(str => (str, vec))
  }

Toy example (I'm running it in spark-shell):

val rdd = sc.parallelize(Array((Array("foo", "bar"), 100), (Array("one", "two"), 200)))
val new_rdd = rdd
  .map {
    case (array: Array[String], vec: Int) => array.map(str => (str, vec))
  }
  .flatMap(arr => arr)
new_rdd.collect
res14: Array[(String, Int)] = Array((foo,100), (bar,100), (one,200), (two,200))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM