通过索引将RDD中的两个数组关联起来

Question

I have an RDD contains two arrays for each row RDD[(Array[Int], Array[Double])] . 我有一个RDD，其中每行RDD[(Array[Int], Array[Double])]包含两个数组。 For each row, the two arrays have similar size of n . 对于每一行，两个数组的大小类似n 。 However, every row has different size of n , and n could be up to 200. The sample data is as follows: 但是，每行的大小n都不同，并且n最多可以为200。示例数据如下：

(Array(1, 3, 5), Array(1.0, 1.0, 2.0))
(Array(6, 3, 1, 9), Array(2.0, 1.0, 2.0, 1.0))
(Array(2, 4), Array(1.0, 3.0))
. . .

I want to combine between those two arrays according to the index for each line. 我想根据每行的索引在这两个数组之间进行组合。 So, the expected output is as follows: 因此，预期输出如下：

((1,1.0), (3,1.0), (5,2.0))
((6,2.0), (3,1.0), (1,2.0), (9,1.0))
((2,1.0), (4,3.0))

This is my code: 这是我的代码：

val data = spark.sparkContext.parallelize(Seq( (Array(1, 3, 5),Array(1.0, 1.0, 2.0)), (Array(6, 3, 1,9),Array(2.0, 1.0, 2.0, 1.0)) , (Array(2, 4),Array(1.0, 3.0)) ) )
val pairArr = data.map{x =>
  (x._1(0), x._2(0))
}
//pairArr: Array((1,1.0), (6,2.0), (2,1.0))

This code only takes the value of the first index in each row. 此代码仅采用每行中第一个索引的值。
Can anybody give me direction how to get the expected output? 有人可以指导我如何获得预期的输出吗？

Thanks. 谢谢。

Answer 1

You need to zip the two elements in each tuple: 您需要zip每个元组中的两个元素：

data.map(x => x._1.zip(x._2)).collect
// res1: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

Or with pattern matching: 或使用模式匹配：

data.map{ case (x, y) => x.zip(y) }.collect
// res0: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

通过索引将RDD中的两个数组关联起来

问题描述

1 个解决方案

解决方案1
2 已采纳 2018-02-04 18:36:50

通过索引将RDD中的两个数组关联起来

问题描述

1 个解决方案

解决方案1 2 已采纳 2018-02-04 18:36:50

解决方案1
2 已采纳 2018-02-04 18:36:50