在制作商品地图时如何减少Spark的改组和时间？

Question

I am using spark to read a csv file like this : 我正在使用spark读取这样的csv文件：

x, y, z
x, y
x
x, y, c, f
x, z

I want to make a map of items vs their count. 我想制作一张物品及其数量的地图。 This is the code I wrote : 这是我写的代码：

private def genItemMap[Item: ClassTag](data: RDD[Array[Item]],     partitioner: HashPartitioner): mutable.Map[Item, Long] = {
    val immutableFreqItemsMap = data.flatMap(t => t)
      .map(v => (v, 1L))
      .reduceByKey(partitioner, _ + _)
      .collectAsMap()

    val freqItemsMap = mutable.Map(immutableFreqItemsMap.toSeq: _*)
    freqItemsMap
  }

When I run it, it is taking a lot of time and shuffle space. 当我运行它时，这会花费很多时间并会浪费空间。 Is there a way to reduce the time? 有没有办法减少时间？

I have a 2 node cluster with 2 cores each and 8 partitions. 我有一个2节点集群，每个集群有2个核心和8个分区。 The number of lines in the csv file are 170000. csv文件中的行数为170000。

Answer 1

If you just want to do an unique item count thing, then I suppose you can take the following approach. 如果您只是想做一件独特的物品计数工作，那么我想您可以采用以下方法。

val data: RDD[Array[Item]] = ???

val itemFrequency = data
  .flatMap(arr =>
    arr.map(item => (item, 1))
  )
  .reduceByKey(_ + _)

Do not provide any partitioner while reducing, otherwise it will cause re-shuffling. 减少时不要提供任何分区，否则会导致重新组合。 Just keep it with the partitioning it already had. 只需保留已有的分区即可。

Also... do not collect the distributed data into a local in-memory object like a Map . 另外...不要将分布式数据collect到本地内存对象中，例如Map 。

在制作商品地图时如何减少Spark的改组和时间？

问题描述

1 个解决方案

解决方案1
0 2017-02-09 13:45:09

在制作商品地图时如何减少Spark的改组和时间？

问题描述

1 个解决方案

解决方案1 0 2017-02-09 13:45:09

解决方案1
0 2017-02-09 13:45:09