简体   繁体   中英

Converting Scala OOP code to Functional code

I'm following some Spark tutorials and to my dismay it appears as though the lecturer is not following Scala's best practices. He uses var and state changes whereas I would prefer to work with immutable data; so I changed my code to better fit the functional paradigm, however as I'm still new to the language I'm not 100% my code is equivalent.

The code is essentially suppose to loop from 1 through 10. Every iteration it conducts a .flatMap() operation on an RDD and checks to see if a LongAccumulator has reached a certain value. If it has reached this value, then it breaks the loop, otherwise it executes .reduceByKey() on the RDD then begins the next iteration.

A few things I'm not sure of:

  1. If my implementation of the LongAccumulator is correct (I try incrementing it by passing it into a function) : In my map function
  2. Do I have to store the results of a function (such as .flatMap() on an RDD in a new val or will simply calling rdd.flatMap() store the results in a new memory address and point the prexisting val rdd to the new memory address (in a similar fashion to the work flow of immutable Vectors ) : In my for loop

For those wondering, this is an implementation of a breadth first search algorithm on a graph constructed from a .txt file. Where the lecturer has opted to use colors, I used numbers from 1-3 to simplify code later on. I've also chosen to use Vectors over Arrays .

Initialising variables:

//My code, outside main class


type NodeMetaData = (Vector[Int], Int, Int)
type Node = (Int, NodeMetaData)

//Within main class


val sparkContext = new SparkContext("local[*]", "TestContext")

val hitCounter = sparkContext.longAccumulator("counter")



//Tutorial code, outside main class


// We make our accumulator a "global" Option so we can reference it in a 
mapper later.
var hitCounter:Option[LongAccumulator] = None

// Some custom data types 
// BFSData contains an array of hero ID connections, the distance, and 
color.
type BFSData = (Array[Int], Int, String)
// A BFSNode has a heroID and the BFSData associated with it.
type BFSNode = (Int, BFSData)


//Within main class


val sc = new SparkContext("local[*]", "DegreesOfSeparation") 

// Our accumulator, used to signal when we find the target 
// character in our BFS traversal.
hitCounter = Some(sc.longAccumulator("Hit Counter"))

My for loop:

val normalisedValues = graphText.map(x => parseLine(x, 5306)) //rdd
breakable {
      for (n <- 1 to 10) {
        println("Running bfs iteration#"+n)
        normalisedValues.flatMap(x => mapBfs(x, 14, hitCounter))
        println("Processing " + normalisedValues.count() + " values\n")

        if (hitCounter.isRegistered) {
          if (hitCounter.value > 0) {
            println("Found target from " + hitCounter.value + "different directions")
            break
          } else {
            normalisedValues.reduceByKey(reduceBfs)
          }
        }
      }
    }

Tutorial for loop:

var iterationRdd = createStartingRdd(sc)
var iteration:Int = 0
    for (iteration <- 1 to 10) {
      println("Running BFS Iteration# " + iteration)

      // Create new vertices as needed to darken or reduce distances in the
      // reduce stage. If we encounter the node we're looking for as a GRAY
      // node, increment our accumulator to signal that we're done.
      val mapped = iterationRdd.flatMap(bfsMap)

      // Note that mapped.count() action here forces the RDD to be evaluated, and
      // that's the only reason our accumulator is actually updated.  
      println("Processing " + mapped.count() + " values.")

      if (hitCounter.isDefined) {
        val hitCount = hitCounter.get.value
        if (hitCount > 0) {
          println("Hit the target character! From " + hitCount + 
              " different direction(s).")
          return
        }
      }

      // Reducer combines data for each character ID, preserving the darkest
      // color and shortest path.      
      iterationRdd = mapped.reduceByKey(bfsReduce)
    }
  }

My map function:

def mapBfs(node:Node, targetId:Int, counter: LongAccumulator): Vector[Node] = {
    val relations = node._2
    val results:Vector[Node] = Vector.empty[Node]
    if(relations._2 == 2) {
      relations._1.foreach(x => {
        if (x == targetId) {
          if (counter.isRegistered) counter.add(1) else None
        }
        results :+ (x, (Vector(), counter.value.toInt, 2))

      })

    }

    results :+ (node._1, (relations._1, relations._2, if(relations._2 == 2) 3 else 2))
  }

Tutorial map function:

/** Expands a BFSNode into this node and its children */
  def bfsMap(node:BFSNode): Array[BFSNode] = {

    // Extract data from the BFSNode
    val characterID:Int = node._1
    val data:BFSData = node._2

    val connections:Array[Int] = data._1
    val distance:Int = data._2
    var color:String = data._3

    // This is called from flatMap, so we return an array
    // of potentially many BFSNodes to add to our new RDD
    var results:ArrayBuffer[BFSNode] = ArrayBuffer()

    // Gray nodes are flagged for expansion, and create new
    // gray nodes for each connection
    if (color == "GRAY") {
      for (connection <- connections) {
        val newCharacterID = connection
        val newDistance = distance + 1
        val newColor = "GRAY"

        // Have we stumbled across the character we're looking for?
        // If so increment our accumulator so the driver script knows.
        if (targetCharacterID == connection) {
          if (hitCounter.isDefined) {
            hitCounter.get.add(1)
          }
        }

        // Create our new Gray node for this connection and add it to the results
        val newEntry:BFSNode = (newCharacterID, (Array(), newDistance, newColor))
        results += newEntry
      }

      // Color this node as black, indicating it has been processed already.
      color = "BLACK"
    }

    // Add the original node back in, so its connections can get merged with 
    // the gray nodes in the reducer.
    val thisEntry:BFSNode = (characterID, (connections, distance, color))
    results += thisEntry

    return results.toArray
  }

Solution:

type NodeMetaData = (Vector[Int], Int, Int)
  type Node = (Int, NodeMetaData)



  def main(args: Array[String]): Unit = {

    Logger.getLogger("org").setLevel(Level.ERROR)

    val startID:Int = 5306 //Spiderman
    val targetID:Int = 14 //Adam 3031

    val sparkContext = new SparkContext("local[*]", "TestContext")

    val hitCounter = sparkContext.longAccumulator("counter")

    val nameText = sparkContext.textFile("SparkScala/Marvel-names.txt")
    val nameLookup = nameText.flatMap(mapName)
    val graphText = sparkContext.textFile("SparkScala/Marvel-graph.txt")

    var normalisedValues = graphText.map(x => createNode(x, startID)).reduceByKey((x,y) => (x._1.union(y._1),x._2,x._3))

    for(n <- 1 to 10){
      val searchables: Vector[Int] = normalisedValues.filter(_._2._3 == 3).reduce((x,y) => (0,(x._2._1.union(y._2._1),0,0)))._2._1
      val broadcast = sparkContext.broadcast(searchables)
      val currentDistance = sparkContext.broadcast(n)
      normalisedValues = normalisedValues.map(x => if(broadcast.value.contains(x._1) && x._2._3 != 3) markNode(x, currentDistance.value) else x).map(conquerNode)
    }

    val collection = normalisedValues.collect()

    collection.sortWith((x,y) => x._2._2 < y._2._2 ).foreach(x =>{
      if(x._1 == targetID) {
        println(f"${nameLookup.lookup(targetID).head} is ${x._2._2} degrees away from ${nameLookup.lookup(startID).head}")
      }
    })
  }


  def mapName(line:String): Option[(Int, String)]={
    val values = line.split(" ", 2)
    val name = values(1).replace("\"", "")
    Some(values(0).toInt, name)
  }

  def createNode(line: String, startId:Int): Node ={
    val connections = line.split("\\s+")
    val id = connections(0).toInt
    val relations = connections.map(_.toInt).filter( _ != id ).toVector
    val color = if ( id == startId ) 3 else 1
    val distance = if ( id == startId )  0 else 9999
    (id, ( relations, distance, color ))

  }

  def markNode(node:Node, distance:Int): Node ={
    (node._1,(node._2._1, distance, 2))
  }


  def conquerNode(node:Node): Node = {
    if(node._2._3 == 2) {
      (node._1, (node._2._1, node._2._2, 3))
    } else { node }
  }

The solution was to create 2 separate functions: a function that marked a node as gray (color == 2) and then another that that 'conquered' the node by marking it black. In order to figure out what nodes to mark as grey next; the union of the Vectors of all the previously marked black nodes was broadcast at the start of every loop.

One downside of this implementation was the use of var normalisedValues . This could have been avoided by appending new RDDs to the tail of a Vector , and simply reading the tail at the start of the next iteration however I did not want to do this as it would've resulted in significant mememory overhead if I was to store each RDD . However for demonstration purposes, this will do. Please refer to

https://raw.githubusercontent.com/maciejmarczak/spark-exercises/master/src/main/scala/org/maciejmarczak/scala/DegreesOfSeparation.scala

to see the original code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM