[英]Spark 1.5.1, Scala 2.10.5: how to expand an RDD[Array[String], Vector]
I am using Spark 1.5.1 with Scala 2.10.5 我在Scala 2.10.5中使用Spark 1.5.1
I have an RDD[Array[String], Vector]
for each element of the RDD: 我对
RDD[Array[String], Vector]
的每个元素都有一个RDD[Array[String], Vector]
:
String
in the Array[String]
and combine it with the Vector
to create a tuple (String, Vector)
, this step will lead to the creation of several tuples from each element of the initial RDD Array[String]
中的每个String
都与Vector
结合起来以创建一个元组(String, Vector)
,此步骤将导致从初始RDD的每个元素创建多个元组 The goal is to end by building an RDD of tuples: RDD[(String, Vector)]
, this RDD contains all the tuples created in the previous step. 目标是通过建立元组的RDD结束:
RDD[(String, Vector)]
,此RDD包含上一步中创建的所有元组。
Thanks 谢谢
Consider this : 考虑一下:
rdd.flatMap { case (arr, vec) => arr.map( (s) => (s, vec) ) }
(The first flatMap
lets you get a RDD[(String, Vector)]
as an output as opposed to a map
which would get you a RDD[Array[(String, Vector)]]
) (第一个
flatMap
允许您获得RDD[(String, Vector)]
作为输出,而不是map
,而获得RDD[Array[(String, Vector)]]
)
Have you tried this? 你有尝试过吗?
// rdd: RDD[Array[String], Vector] - initial RDD
val new_rdd = rdd
.flatMap {
case (array: Array[String], vec: Vector) => array.map(str => (str, vec))
}
Toy example (I'm running it in spark-shell): 玩具示例(我在spark-shell中运行它):
val rdd = sc.parallelize(Array((Array("foo", "bar"), 100), (Array("one", "two"), 200)))
val new_rdd = rdd
.map {
case (array: Array[String], vec: Int) => array.map(str => (str, vec))
}
.flatMap(arr => arr)
new_rdd.collect
res14: Array[(String, Int)] = Array((foo,100), (bar,100), (one,200), (two,200))
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.