简体   繁体   English

通过键减少到Spark中的元组列表

[英]Reduce by key into list of tuples in Spark

I am trying to transpose my data so that it is a list of tuples for each key instead of columns of data. 我正在尝试转置数据,以便它是每个键的元组列表,而不是数据列。

So as input I have: 因此,作为输入,我有:

1   234   54   7   9
2   654   34   2   1
1   987   22   4   6

and I want as output written to parquet files: 我想作为输出写入实木复合地板文件:

1:2   ((234, 54, 7, 9), (987, 22, 4, 6))
2:6   ((654, 34 2 1))

As input I have 2 sets of parquet data. 作为输入,我有两组镶木地板数据。 I read them and join as dataframes. 我阅读它们并以数据框的形式加入。 Then I map each line to key-value pairs and then reduceByKey to combine each key into big lists of tuples. 然后,我将每行映射到键值对,然后用reduceByKey将每个键组合成大的元组列表。

val keyedRDD = joinedDF.map(row => (
  ""+row.getInt(0)+":"+(row.getInt(1)/PARTITION_SIZE),
  List(Record(
    row.getInt(1),
    row.getInt(2),
    row.getInt(3),
    row.getInt(4)
  ))
))

val reduced = keyedRDD.reduceByKey(_:::_)

PARTITION_SIZE here is just a variable that I set for each run to split the data into chunks of that size. 这里的PARTITION_SIZE只是我为每次运行设置的变量,用于将数据拆分为该大小的块。 So like, if I pass in 100000 and there's 2 million records, then records numbered 0-99,999 will be in one bucket, 100,000-199,999 will be in the next, and so on. 就像这样,如果我传递了100000条记录,并且有200万条记录,那么编号为0-99,999的记录将在一个存储桶中,而编号为100,000-199,999的记录将在另一个存储桶中,依此类推。

Record is just a simple case class to hold this data, I've tried with just simple tuples and just putting the values in a list by themselves with the same results. 记录只是一个简单的案例类来保存这些数据,我尝试过使用简单的元组,然后将它们自己放在具有相同结果的列表中。

It is my understanding that this should then reduce to the output of one list per key as I described above. 我的理解是,这应该如上所述减少到每个键一个列表的输出。 However, I cannot get this job to finish. 但是,我无法完成这项工作。 In the Spark History Server it always shows it hanging at the map stage (doesn't even start it) even though Ganglia shows at least 80% CPU usage and high memory usage. 在火花历史服务器中,即使Ganglia至少显示80%的CPU使用率和高内存使用率,它始终显示它挂在映射阶段(甚至不启动它)。 The console gets stuck on these messages: 控制台卡在这些消息上:

16/01/18 01:26:10 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 2 is 2485 bytes
16/01/18 01:26:10 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to ip-172-31-7-127.ec2.internal:34337
16/01/18 01:26:10 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 2 to ip-172-31-7-129.ec2.internal:45407
16/01/18 01:26:17 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to ip-172-31-7-128.ec2.internal:59468
16/01/18 01:26:17 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 1 is 75087 bytes
16/01/18 01:26:18 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to ip-172-31-7-127.ec2.internal:34337
16/01/18 01:26:18 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 1 to ip-172-31-7-129.ec2.internal:45407

One dataset is about 3GB and the other about 22GB, so really not big at all. 一个数据集约为3GB,另一个数据集约为22GB,因此实际上一点也不大。 But I thought maybe I was running out of memory (even though I am not getting an OOM or executor lost message until 20+ hours of being stuck). 但是我以为也许我的内存不足了(即使直到被卡住20多个小时才收到OOM或执行程序丢失消息)。 I've tried EMR clusters with m3.xlarge with 2 slave nodes, m3.xlarge with 6 slave nodes, and even r3.xlarge with 6 slave nodes and still get the same problem. 我尝试使用带有2个从属节点的m3.xlarge,带有6个从属节点的m3.xlarge以及甚至具有6个从属节点的r3.xlarge的EMR集群,仍然遇到相同的问题。 I have set up my EMR clusters to give Spark the maximum possible memory allocation, given Spark dynamic allocation, messed with the memoryFraction settings, etc. 我已经设置了我的EMR群集,以为Spark提供最大可能的内存分配,给定Spark动态分配,与memoryFraction设置混淆等。

I just can't figure out why this is getting hung where it is. 我只是不知道为什么它会挂在原处。 I tried simplifying it and just making it a (key, 1) RDD during the map and adding on the reduce and it finished in 20 minutes. 我尝试简化它,并使其在映射过程中成为(键,1)RDD,然后添加reduce,并在20分钟内完成。

Appending to a list in an expensive operation and is a common mistake. 在昂贵的操作中追加到列表是一个常见的错误。 Remember Scala's bias towards immutable objects. 记住Scala对不可变对象的偏见。 The best place to start is, google "Scala list append performance". 最好的起点是,谷歌“ Scala列表追加性能”。 This will give you several great blogs that describe the problem and recommendations in detail. 这将为您提供一些很棒的博客,详细描述问题和建议。 In general, appending to a list is an expensive operation as each operation results in a new list - very compute and memory intensive. 通常,追加到列表是一项昂贵的操作,因为每个操作都会产生一个新列表-非常耗费计算量和内存。 You can prepend to the list or the best answer is often a listbuffer. 您可以排在列表前面,或者最好的答案通常是列表缓冲区。 Again, look at the blogs and note the performance characteristics 再次,查看博客并注意性能特征

http://www.scala-lang.org/docu/files/collections-api/collections_40.html http://www.scala-lang.org/docu/files/collections-api/collections_40.html

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM