简体   繁体   English

高效计算火花中的熵

[英]Efficient calculation of entropy in spark

Given an RDD (data), and a list of index fields to calculate entropy on. 给定RDD(数据),以及用于计算熵的索引字段列表。 When executing the following flow it takes approximately 5s to calculate a single entropy value on a 2MB (16k row) source. 执行以下流程时,需要大约5秒来计算2MB(16k行)源上的单个熵值。

def entropy(data: RDD[Array[String]], colIdx: Array[Int], count: Long): Double = { 
  println(data.toDebugString)
    data.map(r => colIdx.map(idx => r(idx)).mkString(",") -> 1)
        .reduceByKey(_ + _)
        .map(v => {
        val p = v._2.toDouble / count
        -p * scala.math.log(p) / scala.math.log(2)
      })
        .reduce((v1, v2) => v1 + v2)
}

The output of the debugString is as follows: debugString的输出如下:

(entropy,MappedRDD[93] at map at Q.scala:31 (8 partitions)
  UnionRDD[72] at $plus$plus at S.scala:136 (8 partitions)
    MappedRDD[60] at map at S.scala:151 (4 partitions)
      FilteredRDD[59] at filter at S.scala:150 (4 partitions)
        MappedRDD[40] at map at S.scala:124 (4 partitions)
          MapPartitionsRDD[39] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
            FilteredRDD[27] at filter at S.scala:104 (4 partitions)
              MappedRDD[8] at map at X.scala:21 (4 partitions)
                MappedRDD[6] at map at R.scala:39 (4 partitions)
                  FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
                    HadoopRDD[4] at objectFile at F.scala:51 (4 partitions)
    MappedRDD[68] at map at S.scala:151 (4 partitions)
      FilteredRDD[67] at filter at S.scala:150 (4 partitions)
        MappedRDD[52] at map at S.scala:124 (4 partitions)
          MapPartitionsRDD[51] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
            FilteredRDD[28] at filter at S.scala:105 (4 partitions)
              MappedRDD[8] at map at X.scala:21 (4 partitions)
                MappedRDD[6] at map at R.scala:39 (4 partitions)
                  FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
                    HadoopRDD[4] at objectFile at F.scala:51 (4 partitions),colIdex,13,count,3922)

If I collect the RDD and parallelize again it takes ~150ms to calculate (which still seems high for a simple 2MB file) - and obviously creates a challenge when dealing with multiple GB data. 如果我收集RDD并再次并行化 ,则需要大约150ms来计算(对于简单的2MB文件来说仍然很高) - 并且在处理多个GB数据时显然会产生挑战。 What am I missing to properly utilize Spark and Scala? 为了正确使用Spark和Scala,我错过了什么?

My original implementation (which performed even worse): 我最初的实现(表现更差):

data.map(r => colIdx
  .map(idx => r(idx)).mkString(","))
  .groupBy(r => r)
  .map(g => g._2.size)
  .map(v => v.toDouble / count)
  .map(v => -v * scala.math.log(v) / scala.math.log(2))
  .reduce((v1, v2) => v1 + v2)

Firstly it looks like there is a bug in your code, you need to handle a p of 0 so -p * math.log(p) / math.log(2) should be if (p == 0.0) 0.0 else -p * math.log(p) / math.log(2) . 首先看起来你的代码中有一个bug,你需要处理p0所以-p * math.log(p) / math.log(2)应该是if (p == 0.0) 0.0 else -p * math.log(p) / math.log(2)

Secondly, you can use base e, you don't really need to have a base of 2. 其次,你可以使用base e,你真的不需要有2的基数。

Anyway the reason your code is slow is likely due to having so few partitions. 无论如何,你的代码很慢的原因可能是因为分区很少。 You should have at least 2-4 partitions per CPU, in practice I often use more. 每个CPU应该至少有2-4个分区,实际上我经常使用更多分区。 How many CPUs do you have? 你有多少CPU?

Now what is probably taking the longest amount of time is not the entropy calculation as it's very trivial - but the reduceByKey which is being done on a String keys. 现在可能花费最长时间的不是熵计算,因为它非常简单 - 但是在String键上进行的reduceByKey Is it possible to use some other data type? 是否可以使用其他一些数据类型? What actually is colIdx? 什么是colIdx? What exactly is r? 究竟是什么?

One final observation is your indexing each record many times with this colIdx.map(r.apply) ... you know this will be very slow if r is not of type Array or IndexedSeq ... if it's a List it will be O(index) as you have to traverse the list to get out the index you want. 最后一个观察是你使用这个colIdx.map(r.apply)多次索引每条记录...你知道如果r不是ArrayIndexedSeq类型,这将非常慢...如果它是一个List它将是O (索引)因为你必须遍历列表以获得你想要的索引。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM