简体   繁体   中英

Add random elements to keyed RDD from the same RDD

Imagine we have a keyed RDD RDD[(Int, List[String])] with thousands of keys and thousands to millions of values:

val rdd = sc.parallelize(Seq(
  (1, List("a")),
  (2, List("a", "b")),
  (3, List("b", "c", "d")),
  (4, List("f"))))

For each key I need to add random values from other keys. Number of elements to add varies and depends on the number of elements in the key. So that the output could look like:

val rdd2: RDD[(Int, List[String])] = sc.parallelize(Seq(
  (1, List("a", "c")),
  (2, List("a", "b", "b", "c")),
  (3, List("b", "c", "d", "a", "a", "f")),
  (4, List("f", "d"))))

I came up with the following solution which is obviously not very efficient (note: flatten and aggregation is optional, I'm good with flatten data):

// flatten the input RDD
val rddFlat: RDD[(Int, String)] = rdd.flatMap(x => x._2.map(s => (x._1, s)))
// calculate number of elements for each key
val count = rddFlat.countByKey().toSeq
// foreach key take samples from the input RDD, change the original key and union all RDDs
val rddRandom: RDD[(Int, String)] = count.map { x =>
  (x._1, rddFlat.sample(withReplacement = true, x._2.toDouble / count.map(_._2).sum, scala.util.Random.nextLong()))
}.map(x => x._2.map(t => (x._1, t._2))).reduce(_.union(_))

// union the input RDD with the random RDD and aggregate
val rddWithRandomData: RDD[(Int, List[String])] = rddFlat
    .union(rddRandom)
    .aggregateByKey(List[String]())(_ :+ _, _ ++ _)

What's the most efficient and elegant way to achieve that?
I use Spark 1.4.1.

By looking at the current approach, and in order to ensure the scalability of the solution, probably the area of focus should be to come up with a sampling mechanism that can be done in a distributed fashion, removing the need for collecting the keys back to the driver.

In a nutshell, we need a distributed method to a weighted sample of all the values.

What I propose is to create a matrix keys x values where each cell is the probability of the value being chosen for that key. Then, we can randomly score that matrix and pick those values that fall within the probability.

Let's write a spark-based algo for that:

// sample data to guide us. 
//Note that I'm using distinguishable data across keys to see how the sample data distributes over the keys 
val data = sc.parallelize(Seq(
  (1, List("A", "B")),
  (2, List("x", "y", "z")),
  (3, List("1", "2", "3", "4")),
  (4, List("foo", "bar")),
  (5, List("+")),
  (6, List())))

val flattenedData = data.flatMap{case (k,vlist) => vlist.map(v=> (k,v))}
val values = data.flatMap{case (k,list) => list}
val keysBySize = data.map{case (k, list) => (k,list.size)}
val totalElements = keysBySize.map{case (k,size) => size}.sum
val keysByProb = keysBySize.mapValues{size => size.toDouble/totalElements}
val probMatrix = keysByProb.cartesian(values)
val scoredSamples = probMatrix.map{case ((key, prob),value) => 
    ((key,value),(prob, Random.nextDouble))}

ScoredSamples looks like this:

((1,A),(0.16666666666666666,0.911900315814998))
((1,B),(0.16666666666666666,0.13615047422122906))
((1,x),(0.16666666666666666,0.6292430257377151))
((1,y),(0.16666666666666666,0.23839887096373114))
((1,z),(0.16666666666666666,0.9174808344986465))

...

val samples = scoredSamples.collect{case (entry, (prob,score)) if (score<prob) => entry}

samples looks like this:

(1,foo)
(1,bar)
(2,1)
(2,3)
(3,y)
...

Now, we union our sampled data with the original and have our final result.

val result = (flattenedData union samples).groupByKey.mapValues(_.toList)

result.collect()
(1,List(A, B, B))
(2,List(x, y, z, B))
(3,List(1, 2, 3, 4, z, 1))
(4,List(foo, bar, B, 2))
(5,List(+, z))

Given that all the algorithm is written as a sequence of transformations on the original data (see DAG below), with minimal shuffling (only the last groupByKey , which is done over a minimal result set), it should be scalable. The only limitation would be the list of values per key in the groupByKey stage, which is only to comply with the representation used the question.

在此处输入图片说明

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM