I am using Spark 1.5.1 with Scala 2.10.5
I have an RDD[Array[String], Vector]
for each element of the RDD:
String
in the Array[String]
and combine it with the Vector
to create a tuple (String, Vector)
, this step will lead to the creation of several tuples from each element of the initial RDD The goal is to end by building an RDD of tuples: RDD[(String, Vector)]
, this RDD contains all the tuples created in the previous step.
Thanks
Consider this :
rdd.flatMap { case (arr, vec) => arr.map( (s) => (s, vec) ) }
(The first flatMap
lets you get a RDD[(String, Vector)]
as an output as opposed to a map
which would get you a RDD[Array[(String, Vector)]]
)
Have you tried this?
// rdd: RDD[Array[String], Vector] - initial RDD
val new_rdd = rdd
.flatMap {
case (array: Array[String], vec: Vector) => array.map(str => (str, vec))
}
Toy example (I'm running it in spark-shell):
val rdd = sc.parallelize(Array((Array("foo", "bar"), 100), (Array("one", "two"), 200)))
val new_rdd = rdd
.map {
case (array: Array[String], vec: Int) => array.map(str => (str, vec))
}
.flatMap(arr => arr)
new_rdd.collect
res14: Array[(String, Int)] = Array((foo,100), (bar,100), (one,200), (two,200))
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.