The original data I have looks like this:
RDD data:
key -> index
1 -> 2
1 -> 3
1 -> 5
2 -> 1
2 -> 3
2 -> 4
How can I convert the RDD to the following format?
key -> index1, index2, index3, index4, index5
1 -> 0,1,1,0,1
2 -> 1,0,1,1,0
My current method is:
val vectors = filtered_data_by_key.map( x => {
var temp = Array[AnyVal]()
x._2.copyToArray(temp)
(x._1, Vectors.sparse(filtered_key_size, temp.map(_.asInstanceOf[Int]), Array.fill(filtered_key_size)(1) ))
})
I got some strange error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 54.0 failed 1 times, most recent failure: Lost task 3.0 in stage 54.0 (TID 75, localhost): java.lang.IllegalArgumentException: requirement failed
When I try to debug this program using the following code:
val vectors = filtered_data_by_key.map( x => {
val temp = Array[AnyVal]()
val t = x._2.copyToArray(temp)
(x._1, temp)
})
I found temp is empty, so the problem is in copyToArray()
.
I am not sure how to solve this.
I don't understand the question completely. Why are your keys important? And what is the maximum index value? In your code you arre using distinct number of keys as the maximum value of index but I believe that is a mistake.
But I will assume the maximum index value is 5. In that case, I believe this would be what you're looking for:
val vectors = data_by_key.map({case(k,it)=>Vectors.sparse(5,it.map(x=>x-1).toArray,
Array.fill(it.size)(1))})
val rm = new RowMatrix(vectors)
I decreased index number by one because they should start with 0.
The error 'requirement failed' is due to your index and values vectors not having the same size.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.