简体   繁体   中英

How to transform Scala nested map operation to Scala Spark operation?

Below code calculates eucleudian distance between two List in a dataset :

 val user1 = List("a", "1", "3", "2", "6", "9")  //> user1  : List[String] = List(a, 1, 3, 2, 6, 9)
  val user2 = List("b", "1", "2", "2", "5", "9")  //> user2  : List[String] = List(b, 1, 2, 2, 5, 9)

  val all = List(user1, user2)                    //> all  : List[List[String]] = List(List(a, 1, 3, 2, 6, 9), List(b, 1, 2, 2, 5,
                                                  //|  9))



  def euclDistance(userA: List[String], userB: List[String]) = {
    println("comparing "+userA(0) +" and "+userB(0))
    val zipped = userA.zip(userB)
    val lastElements = zipped match {
      case (h :: t) => t
    }
    val subElements = lastElements.map(m => ((m._1.toDouble - m._2.toDouble) * (m._1.toDouble - m._2.toDouble)))
    val summed = subElements.sum
    val sqRoot = Math.sqrt(summed)

    sqRoot
  }                                               //> euclDistance: (userA: List[String], userB: List[String])Double

  all.map(m => (all.map(m2 => euclDistance(m,m2))))
                                                  //> comparing a and a
                                                  //| comparing a and b
                                                  //| comparing b and a
                                                  //| comparing b and b
                                                  //| res0: List[List[Double]] = List(List(0.0, 1.4142135623730951), List(1.414213
                                                  //| 5623730951, 0.0))

But how can this be translated into parallel Spark Scala operation ?

When I print the contents of distAll :

scala> distAll.foreach(p => p.foreach(println))
14/10/24 23:09:42 INFO SparkContext: Starting job: foreach at <console>:21
14/10/24 23:09:42 INFO DAGScheduler: Got job 2 (foreach at <console>:21) with 4
output partitions (allowLocal=false)
14/10/24 23:09:42 INFO DAGScheduler: Final stage: Stage 2(foreach at <console>:2
1)
14/10/24 23:09:42 INFO DAGScheduler: Parents of final stage: List()
14/10/24 23:09:42 INFO DAGScheduler: Missing parents: List()
14/10/24 23:09:42 INFO DAGScheduler: Submitting Stage 2 (ParallelCollectionRDD[1
] at parallelize at <console>:18), which has no missing parents
14/10/24 23:09:42 INFO MemoryStore: ensureFreeSpace(1152) called with curMem=115
2, maxMem=278019440
14/10/24 23:09:42 INFO MemoryStore: Block broadcast_2 stored as values in memory
 (estimated size 1152.0 B, free 265.1 MB)
14/10/24 23:09:42 INFO DAGScheduler: Submitting 4 missing tasks from Stage 2 (Pa
rallelCollectionRDD[1] at parallelize at <console>:18)
14/10/24 23:09:42 INFO TaskSchedulerImpl: Adding task set 2.0 with 4 tasks
14/10/24 23:09:42 INFO TaskSetManager: Starting task 0.0 in stage 2.0 (TID 8, lo
calhost, PROCESS_LOCAL, 1169 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 1.0 in stage 2.0 (TID 9, lo
calhost, PROCESS_LOCAL, 1419 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 2.0 in stage 2.0 (TID 10, l
ocalhost, PROCESS_LOCAL, 1169 bytes)
14/10/24 23:09:42 INFO TaskSetManager: Starting task 3.0 in stage 2.0 (TID 11, l
ocalhost, PROCESS_LOCAL, 1420 bytes)
14/10/24 23:09:42 INFO Executor: Running task 0.0 in stage 2.0 (TID 8)
14/10/24 23:09:42 INFO Executor: Running task 1.0 in stage 2.0 (TID 9)
14/10/24 23:09:42 INFO Executor: Running task 3.0 in stage 2.0 (TID 11)
a14/10/24 23:09:42 INFO Executor: Running task 2.0 in stage 2.0 (TID 10)

14/10/24 23:09:42 INFO Executor: Finished task 2.0 in stage 2.0 (TID 10). 585 by
tes result sent to driver
114/10/24 23:09:42 INFO TaskSetManager: Finished task 2.0 in stage 2.0 (TID 10)
in 16 ms on localhost (1/4)

314/10/24 23:09:42 INFO Executor: Finished task 0.0 in stage 2.0 (TID 8). 585 by
tes result sent to driver

214/10/24 23:09:42 INFO TaskSetManager: Finished task 0.0 in stage 2.0 (TID 8) i
n 16 ms on localhost (2/4)

6
9
14/10/24 23:09:42 INFO Executor: Finished task 1.0 in stage 2.0 (TID 9). 585 byt
es result sent to driver
b14/10/24 23:09:42 INFO TaskSetManager: Finished task 1.0 in stage 2.0 (TID 9) i
n 16 ms on localhost (3/4)

1
2
2
5
9
14/10/24 23:09:42 INFO Executor: Finished task 3.0 in stage 2.0 (TID 11). 585 by
tes result sent to driver
14/10/24 23:09:42 INFO TaskSetManager: Finished task 3.0 in stage 2.0 (TID 11) i
n 31 ms on localhost (4/4)
14/10/24 23:09:42 INFO DAGScheduler: Stage 2 (foreach at <console>:21) finished
in 0.031 s
14/10/24 23:09:42 INFO TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have
all completed, from pool
14/10/24 23:09:42 INFO SparkContext: Job finished: foreach at <console>:21, took
 0.037641021 s

The distances are not populated ?

Update :

To get Eugene Zhulenev answer below to work for me I required to make following changes :

extend UserObject with java.io.Serializable

also rename User to UserObject.

Here is updated code :

val user1 = List("a", "1", "3", "2", "6", "9")    //> user1  : List[String] = List(a, 1, 3, 2, 6, 9)
  val user2 = List("b", "1", "2", "2", "5", "9")  //> user2  : List[String] = List(b, 1, 2, 2, 5, 9)

  case class User(name: String, features: Vector[Double])

object UserObject extends java.io.Serializable {
    def fromList(list: List[String]): User = list match {
      case h :: tail => User(h, tail.map(_.toDouble).toVector)
    }
  }

 val all = List(UserObject.fromList(user1), UserObject.fromList(user2))


    val users= sc.parallelize(all.combinations(2).toSeq.map {
    case l :: r :: Nil => (l, r)
  })

   def euclDistance(userA: User, userB: User) = {
    println(s"comparing ${userA.name} and ${userB.name}")
    val subElements = (userA.features zip userB.features) map {
      m => (m._1 - m._2) * (m._1 - m._2)
    }
    val summed = subElements.sum
    val sqRoot = Math.sqrt(summed)

println("value is"+sqRoot)
    sqRoot
  }

  users.foreach(t => euclDistance(t._1, t._2))

Update 2 :

I've tried code in maasg answer but receive error :

scala> val userDistanceRdd = usersRdd.map { case (user1, user2) => {
     |         val data = sc.broadcast.value
     |         val distance = euclidDistance(data(user1), data(user2))
     |         ((user1, user2),distance)
     |     }
     |     }
<console>:27: error: missing arguments for method broadcast in class SparkContex
t;
follow this method with `_' if you want to treat it as a partially applied funct
ion
               val data = sc.broadcast.value

Here is the entire code with my amendments :

type UserId = String
type UserData = Array[Double]

val users: List[UserId]= List("a" , "b")
val data: Map[UserId,UserData] = Map( ("a" , Array(3.0,4.0)),
("b" , Array(3.0,4.0)) )

def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ combinations(t)
}
val broadcastData = sc.broadcast(data)

val usersRdd = sc.parallelize(combinations(users))

val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map { case (user1, user2) => {
        val data = sc.broadcast.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }
    }

For maasg code to work I needed to add } to userDistanceRdd function.

Code :

type UserId = String
type UserData = Array[Double]

val users: List[UserId] = List("a" , "b")

val data: Map[UserId,UserData] = Map( ("a" , Array(3.0,4.0)),
("b" , Array(3.0,3.0)) )

def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ combinations(t)
}

val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
        val data = broadcastData.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }
    }

userDistanceRdd.foreach(println)

First of all I suggest you to move from storing you user model in list, to well typed class. And then I don't think you need to compute distance between the same users like (aa) and (bb), and no reason to compute distance twice (ab) (ba).

  val user1 = List("a", "1", "3", "2", "6", "9")
  val user2 = List("b", "1", "2", "2", "5", "9")

  case class User(name: String, features: Vector[Double])

  object User {
    def fromList(list: List[String]): User = list match {
      case h :: tail => User(h, tail.map(_.toDouble).toVector)
    }
  }

  def euclDistance(userA: User, userB: User) = {
    println(s"comparing ${userA.name} and ${userB.name}")
    val subElements = (userA.features zip userB.features) map {
      m => (m._1 - m._2) * (m._1 - m._2)
    }
    val summed = subElements.sum
    val sqRoot = Math.sqrt(summed)

    sqRoot
  }

  val all = List(User.fromList(user1), User.fromList(user2))


  val users: RDD[(User, User)] = sc.parallelize(all.combinations(2).toSeq.map {
    case l :: r :: Nil => (l, r)
  })

  users.foreach(t => euclDistance(t._1, t._2))

The actual solution will depend on the dimensions of the dataset. Assuming that the original dataset fits in memory and you want to parallelize the computation of the euclidean distance, I'd proceed like this:

Assume users is the list of users by some id and userData is the data to be processed per user indexed by id.

// sc is the Spark Context
type UserId = String
type UserData = Array[Double]

val users: List[UserId]= ???
val data: Map[UserId,UserData] = ???
// combination generates the unique pairs of users for which distance makes sense
// given that euclidDistance (a,b) = eclidDistance(b,a) only (a,b) is in this set
def combinations[T](l: List[T]): List[(T,T)] = l match {
    case Nil => Nil
    case h::Nil => Nil
    case h::t => t.map(x=>(h,x)) ++ comb(t)
}

// broadcasts the data to all workers
val broadcastData = sc.broadcast(data)
val usersRdd = sc.parallelize(combinations(users))
val euclidDistance: (UserData, UserData) => Double = (x,y) => 
    math.sqrt((x zip y).map{case (a,b) => math.pow(a-b,2)}.sum)
val userDistanceRdd = usersRdd.map{ case (user1, user2) => {
        val data = broadcastData.value
        val distance = euclidDistance(data(user1), data(user2))
        ((user1, user2),distance)
    }

In case that the user data is too large, instead of using a broadcast variable, you would load that from external storage.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM