简体   繁体   中英

Empty Collection when doing map reduce in scala

I'm encountering a spark job quitting with error message as empty collection.

java.lang.UnsupportedOperationException: empty collection

I have zoomed into 2 lines that caused the issue.

sum_attribute1 = inputRDD.map(_.attribute1).reduce(_+_)
sum_attribute2 = inputRDD.map(_.attribute2).reduce(_+_)`

Other lines that does .map and .distinct.count is fine. I like to print out inputRDD.map(attribute1) and inputRDD.map(_.attribute2) to see what was map before the reduce.

I thought I could define something like

sum_attribute1 = inputRDD.map(_.attribute1)

but when I tried to compile the code, it shows errors:

[error]  found   : org.apache.spark.rdd.RDD[Int]
[error]  required: Long
[error] sum_attribute1 = inputRDD.map(_.attribute1)
[error]                              ^

My attribute1 was defined as Int but when I tried to define it as Long, it gave me another error.

Am I going in the right direction? How can I print the data after map and before reduce? What could be the possible issue with empty collection? What does the underscore in _.attribute1 and reduce(_+_) mean?

I don't think that you are going in the right direction, I would focus on the elements below:

I recommend that you learn a bit of scala first. To one of your specific question read about that usage of _ .

To another of your question, reduce cannot be used on empty collection, I recommend using fold instead as it supports empty collections just fine.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM