简体   繁体   中英

Reduce by key into list of tuples in Spark

I am trying to transpose my data so that it is a list of tuples for each key instead of columns of data.

So as input I have:

1   234   54   7   9
2   654   34   2   1
1   987   22   4   6

and I want as output written to parquet files:

1:2   ((234, 54, 7, 9), (987, 22, 4, 6))
2:6   ((654, 34 2 1))

As input I have 2 sets of parquet data. I read them and join as dataframes. Then I map each line to key-value pairs and then reduceByKey to combine each key into big lists of tuples.

val keyedRDD = joinedDF.map(row => (
  ""+row.getInt(0)+":"+(row.getInt(1)/PARTITION_SIZE),
  List(Record(
    row.getInt(1),
    row.getInt(2),
    row.getInt(3),
    row.getInt(4)
  ))
))

val reduced = keyedRDD.reduceByKey(_:::_)

PARTITION_SIZE here is just a variable that I set for each run to split the data into chunks of that size. So like, if I pass in 100000 and there's 2 million records, then records numbered 0-99,999 will be in one bucket, 100,000-199,999 will be in the next, and so on.

Record is just a simple case class to hold this data, I've tried with just simple tuples and just putting the values in a list by themselves with the same results.

It is my understanding that this should then reduce to the output of one list per key as I described above. However, I cannot get this job to finish. In the Spark History Server it always shows it hanging at the map stage (doesn't even start it) even though Ganglia shows at least 80% CPU usage and high memory usage. The console gets stuck on these messages:

16/01/18 01:26:10 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 2485 bytes
16/01/18 01:26:10 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to ip-172-31-7-127.ec2.internal:34337
16/01/18 01:26:10 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to ip-172-31-7-129.ec2.internal:45407
16/01/18 01:26:17 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to ip-172-31-7-128.ec2.internal:59468
16/01/18 01:26:17 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 75087 bytes
16/01/18 01:26:18 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to ip-172-31-7-127.ec2.internal:34337
16/01/18 01:26:18 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to ip-172-31-7-129.ec2.internal:45407

One dataset is about 3GB and the other about 22GB, so really not big at all. But I thought maybe I was running out of memory (even though I am not getting an OOM or executor lost message until 20+ hours of being stuck). I've tried EMR clusters with m3.xlarge with 2 slave nodes, m3.xlarge with 6 slave nodes, and even r3.xlarge with 6 slave nodes and still get the same problem. I have set up my EMR clusters to give Spark the maximum possible memory allocation, given Spark dynamic allocation, messed with the memoryFraction settings, etc.

I just can't figure out why this is getting hung where it is. I tried simplifying it and just making it a (key, 1) RDD during the map and adding on the reduce and it finished in 20 minutes.

Appending to a list in an expensive operation and is a common mistake. Remember Scala's bias towards immutable objects. The best place to start is, google "Scala list append performance". This will give you several great blogs that describe the problem and recommendations in detail. In general, appending to a list is an expensive operation as each operation results in a new list - very compute and memory intensive. You can prepend to the list or the best answer is often a listbuffer. Again, look at the blogs and note the performance characteristics

http://www.scala-lang.org/docu/files/collections-api/collections_40.html

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM