简体   繁体   中英

Associating two arrays in an RDD by index

I have an RDD contains two arrays for each row RDD[(Array[Int], Array[Double])] . For each row, the two arrays have similar size of n . However, every row has different size of n , and n could be up to 200. The sample data is as follows:

(Array(1, 3, 5), Array(1.0, 1.0, 2.0))
(Array(6, 3, 1, 9), Array(2.0, 1.0, 2.0, 1.0))
(Array(2, 4), Array(1.0, 3.0))
. . .

I want to combine between those two arrays according to the index for each line. So, the expected output is as follows:

((1,1.0), (3,1.0), (5,2.0))
((6,2.0), (3,1.0), (1,2.0), (9,1.0))
((2,1.0), (4,3.0))

This is my code:

val data = spark.sparkContext.parallelize(Seq( (Array(1, 3, 5),Array(1.0, 1.0, 2.0)), (Array(6, 3, 1,9),Array(2.0, 1.0, 2.0, 1.0)) , (Array(2, 4),Array(1.0, 3.0)) ) )
val pairArr = data.map{x =>
  (x._1(0), x._2(0))
//pairArr: Array((1,1.0), (6,2.0), (2,1.0))

This code only takes the value of the first index in each row.
Can anybody give me direction how to get the expected output?


You need to zip the two elements in each tuple:

data.map(x => x._1.zip(x._2)).collect
// res1: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

Or with pattern matching:

data.map{ case (x, y) => x.zip(y) }.collect
// res0: Array[Array[(Int, Double)]] = Array(Array((1,1.0), (3,1.0), (5,2.0)), Array((6,2.0), (3,1.0), (1,2.0), (9,1.0)), Array((2,1.0), (4,3.0)))

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

粤ICP备18138465号  © 2020-2024 STACKOOM.COM