Apache SPARK GroupByKey alternate

Question

I have below columns in my table [col1,col2,key1,col3,txn_id,dw_last_updated]. Out of these txn_id , key1 are primary key columns. In my dataset I can have multiple records for the combination of (txn_id,key). Out of those records , I need to pick the latest one one based on dw_last_updated..

I'm using a logic this. I'm consistently hitting memory issue and I believe its partly because of groupByKey()... Is there a better alternate for this ?

case class Fact(col1: Int,
  col2: Int,
  key1: String,
  col3: Int,
  txn_id: Double,
  dw_last_updated: Long)

sc.textFile(s3path).map { row =>
          val parts = row.split("\t")
          Fact(parts(0).toInt,
            parts(1).toInt,
            parts(2),
            parts(3).toInt,
            parts(4).toDouble,
            parts(5).toLong)
        }).map { t => ((t.txn_id, t.key1), t) }.groupByKey(512).map {
          case ((txn_id, key1), sequence) =>
            val newrecord = sequence.maxBy {
              case Fact_Cp(col1, col2, key1, col3, txn_id, dw_last_updated) => dw_last_updated.toLong
            }
           (newrecord.col1 + "\t" + newrecord.col2 + "\t" + newrecord.key1 +
              "\t" + newrecord.col3 + "\t" + newrecord.txn_id + "\t" + newrecord.dw_last_updated)
        }

Appreciate your thoughts / suggestions...

Answer 1

rdd.groupByKey collects all values per key, requiring the necessary memory to hold the sequence of values for a key on a single node. Its use is discouraged. See [1] .

Given that we are interested in only 1 value per key: max(dw_last_updated), a more memory efficient way would be to use rdd.reduceByKey where the reduce function here is to pick up the max of the two records for the same key using that timestamp as discriminant.

rdd.reduceByKey{case (record1,record2) => max(record1, record2)}

Applied to your case, it should look like this:

case class Fact(...)
object Fact {
  def parse(s:String):Fact = ???
  def maxByTs(f1:Fact, f2:Fact):Fact = if (f1.dw_last_updated.toLong > f2.dw_last_updated.toLong) f1 else f2
}
val factById = sc.textFile(s3path).map{row => val fact = Fact.parse(row); ((fact.txn_id, fact.key1),fact)}
val maxFactById = factById.reduceByKey(Fact.maxByTs)

Note that I've defined utility operations on the Fact companion object to keep the code tidy. I also advice to give named variables to each transformation step or logical group of steps. It makes it the program more readable.

Apache SPARK GroupByKey alternate

Question

1 answers

solution1
1 2014-12-10 22:26:33

Apache SPARK GroupByKey alternate

Question

1 answers

solution1 1 2014-12-10 22:26:33

solution1
1 2014-12-10 22:26:33