简体   繁体   English

在制作商品地图时如何减少Spark的改组和时间?

[英]How to reduce shuffling and time taken by Spark while making a map of items?

I am using spark to read a csv file like this : 我正在使用spark读取这样的csv文件:

x, y, z
x, y
x
x, y, c, f
x, z

I want to make a map of items vs their count. 我想制作一张物品及其数量的地图。 This is the code I wrote : 这是我写的代码:

private def genItemMap[Item: ClassTag](data: RDD[Array[Item]],     partitioner: HashPartitioner): mutable.Map[Item, Long] = {
    val immutableFreqItemsMap = data.flatMap(t => t)
      .map(v => (v, 1L))
      .reduceByKey(partitioner, _ + _)
      .collectAsMap()

    val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
    freqItemsMap
  }

When I run it, it is taking a lot of time and shuffle space. 当我运行它时,这会花费很多时间并会浪费空间。 Is there a way to reduce the time? 有没有办法减少时间?

I have a 2 node cluster with 2 cores each and 8 partitions. 我有一个2节点集群,每个集群有2个核心和8个分区。 The number of lines in the csv file are 170000. csv文件中的行数为170000。

If you just want to do an unique item count thing, then I suppose you can take the following approach. 如果您只是想做一件独特的物品计数工作,那么我想您可以采用以下方法。

val data: RDD[Array[Item]] = ???

val itemFrequency = data
  .flatMap(arr =>
    arr.map(item => (item, 1))
  )
  .reduceByKey(_ + _)

Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. 减少时不要提供任何分区,否则会导致重新组合。 Just keep it with the partitioning it already had. 只需保留已有的分区即可。

Also... do not collect the distributed data into a local in-memory object like a Map . 另外...不要将分布式数据collect到本地内存对象中,例如Map

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM