简体   繁体   中英

Efficient calculation of entropy in spark

Given an RDD (data), and a list of index fields to calculate entropy on. When executing the following flow it takes approximately 5s to calculate a single entropy value on a 2MB (16k row) source.

def entropy(data: RDD[Array[String]], colIdx: Array[Int], count: Long): Double = { 
  println(data.toDebugString)
    data.map(r => colIdx.map(idx => r(idx)).mkString(",") -> 1)
        .reduceByKey(_ + _)
        .map(v => {
        val p = v._2.toDouble / count
        -p * scala.math.log(p) / scala.math.log(2)
      })
        .reduce((v1, v2) => v1 + v2)
}

The output of the debugString is as follows:

(entropy,MappedRDD[93] at map at Q.scala:31 (8 partitions)
  UnionRDD[72] at $plus$plus at S.scala:136 (8 partitions)
    MappedRDD[60] at map at S.scala:151 (4 partitions)
      FilteredRDD[59] at filter at S.scala:150 (4 partitions)
        MappedRDD[40] at map at S.scala:124 (4 partitions)
          MapPartitionsRDD[39] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
            FilteredRDD[27] at filter at S.scala:104 (4 partitions)
              MappedRDD[8] at map at X.scala:21 (4 partitions)
                MappedRDD[6] at map at R.scala:39 (4 partitions)
                  FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
                    HadoopRDD[4] at objectFile at F.scala:51 (4 partitions)
    MappedRDD[68] at map at S.scala:151 (4 partitions)
      FilteredRDD[67] at filter at S.scala:150 (4 partitions)
        MappedRDD[52] at map at S.scala:124 (4 partitions)
          MapPartitionsRDD[51] at mapPartitionsWithIndex at L.scala:356 (4 partitions)
            FilteredRDD[28] at filter at S.scala:105 (4 partitions)
              MappedRDD[8] at map at X.scala:21 (4 partitions)
                MappedRDD[6] at map at R.scala:39 (4 partitions)
                  FlatMappedRDD[5] at objectFile at F.scala:51 (4 partitions)
                    HadoopRDD[4] at objectFile at F.scala:51 (4 partitions),colIdex,13,count,3922)

If I collect the RDD and parallelize again it takes ~150ms to calculate (which still seems high for a simple 2MB file) - and obviously creates a challenge when dealing with multiple GB data. What am I missing to properly utilize Spark and Scala?

My original implementation (which performed even worse):

data.map(r => colIdx
  .map(idx => r(idx)).mkString(","))
  .groupBy(r => r)
  .map(g => g._2.size)
  .map(v => v.toDouble / count)
  .map(v => -v * scala.math.log(v) / scala.math.log(2))
  .reduce((v1, v2) => v1 + v2)

Firstly it looks like there is a bug in your code, you need to handle a p of 0 so -p * math.log(p) / math.log(2) should be if (p == 0.0) 0.0 else -p * math.log(p) / math.log(2) .

Secondly, you can use base e, you don't really need to have a base of 2.

Anyway the reason your code is slow is likely due to having so few partitions. You should have at least 2-4 partitions per CPU, in practice I often use more. How many CPUs do you have?

Now what is probably taking the longest amount of time is not the entropy calculation as it's very trivial - but the reduceByKey which is being done on a String keys. Is it possible to use some other data type? What actually is colIdx? What exactly is r?

One final observation is your indexing each record many times with this colIdx.map(r.apply) ... you know this will be very slow if r is not of type Array or IndexedSeq ... if it's a List it will be O(index) as you have to traverse the list to get out the index you want.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM