简体   繁体   English

通过索引将RDD中的两个数组关联起来

[英]Associating two arrays in an RDD by index

I have an RDD contains two arrays for each row RDD[(Array[Int], Array[Double])] . 我有一个RDD,其中每行RDD[(Array[Int], Array[Double])]包含两个数组。 For each row, the two arrays have similar size of n . 对于每一行,两个数组的大小类似n However, every row has different size of n , and n could be up to 200. The sample data is as follows: 但是,每行的大小n都不同,并且n最多可以为200。示例数据如下:

(Array(1, 3, 5), Array(1.0, 1.0, 2.0))
(Array(6, 3, 1, 9), Array(2.0, 1.0, 2.0, 1.0))
(Array(2, 4), Array(1.0, 3.0))
. . .

I want to combine between those two arrays according to the index for each line. 我想根据每行的索引在这两个数组之间进行组合。 So, the expected output is as follows: 因此,预期输出如下:

((1,1.0), (3,1.0), (5,2.0))
((6,2.0), (3,1.0), (1,2.0), (9,1.0))
((2,1.0), (4,3.0))

This is my code: 这是我的代码:

val data = spark.sparkContext.parallelize(Seq( (Array(1, 3, 5),Array(1.0, 1.0, 2.0)), (Array(6, 3, 1,9),Array(2.0, 1.0, 2.0, 1.0)) , (Array(2, 4),Array(1.0, 3.0)) ) )
val pairArr = data.map{x =>
  (x._1(0), x._2(0))
}
//pairArr: Array((1,1.0), (6,2.0), (2,1.0))

This code only takes the value of the first index in each row. 此代码仅采用每行中第一个索引的值。
Can anybody give me direction how to get the expected output? 有人可以指导我如何获得预期的输出吗?

Thanks. 谢谢。

You need to zip the two elements in each tuple: 您需要zip每个元组中的两个元素:

data.map(x => x._1.zip(x._2)).collect
// res1: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

Or with pattern matching: 或使用模式匹配:

data.map{ case (x, y) => x.zip(y) }.collect
// res0: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM