简体   繁体   中英

How to Reduce by key in “Scala” [Not In Spark]

I am trying to reduceByKeys in Scala, is there any method to reduce the values based on the keys in Scala. [ i know we can do by reduceByKey method in spark, but how do we do the same in Scala ? ]

The input Data is :

val File = Source.fromFile("C:/Users/svk12/git/data/retail_db/order_items/part-00000")
                 .getLines()
                 .toList

 val map = File.map(x => x.split(","))
               .map(x => (x(1),x(4)))

  map.take(10).foreach(println)

After Above Step i am getting the result as:

(2,250.0)
(2,129.99)
(4,49.98)
(4,299.95)
(4,150.0)
(4,199.92)
(5,299.98)
(5,299.95)

Expected Result :

(2,379.99)
(5,499.93)
.......

It looks like you want the sum of some values from a file. One problem is that files are strings, so you have to cast the String to a number format before it can be summed.

These are the steps you might use.

io.Source.fromFile("so.txt") //open file
  .getLines()                //read line-by-line
  .map(_.split(","))         //each line is Array[String]
  .toSeq                     //to something that can groupBy()
  .groupBy(_(1))             //now is Map[String,Array[String]]
  .mapValues(_.map(_(4).toInt).sum) //now is Map[String,Int]
  .toSeq                     //un-Map it to (String,Int) tuples
  .sorted                    //presentation order
  .take(10)                  //sample
  .foreach(println)          //report

This will, of course, throw if any file data is not in the required format.

Starting Scala 2.13 , you can use the groupMapReduce method which is (as its name suggests) an equivalent of a groupBy followed by mapValues and a reduce step:

io.Source.fromFile("file.txt")
  .getLines.to(LazyList)
  .map(_.split(','))
  .groupMapReduce(_(1))(_(4).toDouble)(_ + _)

The groupMapReduce stage:

  • group s splited arrays by their 2nd element ( _(1) ) (group part of group MapReduce)

  • map s each array occurrence within each group to its 4th element and cast it to Double ( _(4).toDouble ) (map part of group Map Reduce)

  • reduce s values within each group ( _ + _ ) by summing them (reduce part of groupMap Reduce ).

This is a one-pass version of what can be translated by:

seq.groupBy(_(1)).mapValues(_.map(_(4).toDouble).reduce(_ + _))

Also note the cast from Iterator to LazyList in order to use a collection which provides groupMapReduce (we don't use a Stream , since starting Scala 2.13 , LazyList is the recommended replacement of Stream s).

There is nothing built-in, but you can write it like this:

def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
  var result = Map.empty[A, B]
  items.foreach {
    case (a, b) =>
      result += (a -> result.get(a).map(b1 => f(b1, b)).getOrElse(b))
  }
  result
}

There is some space to optimize this (eg use mutable maps), but the general idea remains the same.

Another approach, more declarative but less efficient (creates several intermediate collections; can be rewritten but with loss of clarity:

def reduceByKey[A, B](items: Traversable[(A, B)])(f: (B, B) => B): Map[A, B] = {
  items
    .groupBy { case (a, _) => a }
    .mapValues(_.map { case (_, b) => b }.reduce(f))
    // mapValues returns a view, view.force changes it back to a realized map
    .view.force
}

First group the tuple using key, first element here and then reduce. Following code will work -

val reducedList = map.groupBy(_._1).map(l => (l._1, l._2.map(_._2).reduce(_+_)))
print(reducedList)

Here another solution using a foldLeft:

val File : List[String] = ???

File.map(x => x.split(","))
  .map(x => (x(1),x(4).toInt))
  .foldLeft(Map.empty[String,Int]){case (state, (key,value)) => state.updated(key,state.get(key).getOrElse(0)+value)}
  .toSeq
  .sortBy(_._1)
  .take(10)
  .foreach(println)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM