简体   繁体   English

根据加权概率[Spark]采样RDD元素

[英]Sample RDD element(s) according to weighted probability [Spark]

In PySpark I have an RDD composed by (key;value) pairs where key s are sequential integers and value s are floating point numbers. 在PySpark中,我有一个由(key; value)对组成的RDD,其中key s是连续整数,而value s是浮点数。

I'd like to sample exactly one element from this RDD with probability proportional to value . 我想从此RDD中精确取样一个元素,其概率与value成正比。

In a naiive manner, this task can be accomplished as follows: 以幼稚的方式,可以完成以下任务:

pairs = myRDD.collect()   #now pairs is a list of (key;value) tuples
K, V = zip(*pairs)        #separate keys and values
V = numpy.array(V)/sum(V) #normalise probabilities
extractedK = numpy.random.choice(K,size=1,replace=True, p=V)

What concerns me is the collect() operation which, as you might know, loads the entire list of tuples in memory, which can be quite expensive. 我关心的是collect()操作,您可能知道,该操作将整个元组列表加载到内存中,这可能会非常昂贵。 I'm aware of takeSample() , which is great when elements should be extracted uniformly, but what happens if elements should be extracted according to weighted probabilities? 我知道takeSample() ,当应该均匀提取元素时很棒,但是如果应该根据加权概率提取元素会怎样?

Thanks! 谢谢!

Here is an algorithm I worked out to do this: 这是我为此设计的算法:

EXAMPLE PROBLEM 示例问题

Assume we want to sample 10 items from an RDD on 3 partitions like this: 假设我们要在3个分区上从RDD采样10个项目,如下所示:

  • P1: ("A", 0.10), ("B", 0.10), ("C", 0.20) P1:(“ A”,0.10),(“ B”,0.10),(“ C”,0.20)
  • P2: ("D": 0.25), ("E", 0.25) P2 :(“ D”:0.25),(“ E”,0.25)
  • P3: ("F", 0.10) P3 :(“ F”,0.10)

Here is the high-level algorithm: 这是高级算法:

INPUT: number of samples and a RDD of items (with weights) 输入: number of samplesRDD of items (with weights)RDD of items (with weights)

OUTPUT: dataset sample on driver 输出:驱动程序上的dataset sample

  1. For each partition, calculate the total probability of sampling from the partition, and aggregate those values to the driver. 对于每个分区,计算从该分区进行采样的总概率,然后将这些值汇总到驱动程序。
    • This would give the probability distribution: Prob(P1) = 0.40, Prob(P2) = 0.50, Prob(P3) = 0.10 这将给出概率分布: Prob(P1) = 0.40, Prob(P2) = 0.50, Prob(P3) = 0.10
  2. Generate a sample of the partitions (to determine the number of elements to select from each partition.) 生成分区样本(以确定要从每个分区中选择的元素数。)
    • A sample may look like this: [P1, P1, P1, P1, P2, P2, P2, P2, P2, P3] 一个样本可能看起来像这样: [P1, P1, P1, P1, P2, P2, P2, P2, P2, P3]
    • This would give us 4 items from P1, 5 items from P2, and 1 item from P3. 这将为我们提供P1的4个项目,P2的5个项目和P3的1个项目。
  3. On each separate partition, we locally generate a sample of the needed size using only the elements on that partition: 在每个单独的分区上,我们仅使用该分区上的元素在本地生成所需大小的样本:
    • On P1, we would sample 4 items with the (re-normalized) probability distribution: Prob(A) = 0.25, Prob(B) = 0.25, Prob(C) = 0.50 . 在P1上,我们将以(重新归一化)的概率分布对4个项目进行采样: Prob(A) = 0.25, Prob(B) = 0.25, Prob(C) = 0.50 This could yield a sample such as [A, B, C, C] . 这可以产生诸如[A, B, C, C]的样品。
    • On P2, we would sample 5 items with probability distribution: Prob(D) = 0.5, Prob(E) = 0.5 . 在P2上,我们将抽样5个具有概率分布的项目: Prob(D) = 0.5, Prob(E) = 0.5 This could yield a sample such as [D,D,E,E,E] 这可能会产生一个样本,例如[D,D,E,E,E]
    • On P3: sample 1 item with probability distribution: P(F) = 1.0 , this would generate the sample [E] 在P3上:样本1项的概率分布: P(F) = 1.0 ,这将生成样本[E]
  4. Collect the samples to the driver to yield your dataset sample [A,B,C,C,D,D,E,E,E,F] . 将样本Collect到驱动程序以产生数据集样本[A,B,C,C,D,D,E,E,E,F]

Here is an implementation in scala: 这是scala中的一个实现:

  case class Sample[T](weight: Double, obj: T)

  /*
   *  Obtain a sample of size `numSamples` from an RDD `ar` using a two-phase distributed sampling approach.
   */
  def sampleWeightedRDD[T:ClassTag](ar: RDD[Sample[T]], numSamples: Int)(implicit sc: SparkContext): Array[T] = {
    // 1. Get total weight on each partition
    var partitionWeights = ar.mapPartitionsWithIndex{case(partitionIndex, iter) => Array((partitionIndex, iter.map(_.weight).sum)).toIterator }.collect().toArray

    //Normalize to 1.0
    val Z = partitionWeights.map(_._2).sum
    partitionWeights = partitionWeights.map{case(partitionIndex, weight) => (partitionIndex, weight/Z)}

    // 2. Sample from partitions indexes to determine number of samples from each partition
    val samplesPerIndex = sc.broadcast(sample[Int](partitionWeights, numSamples).groupBy(x => x).mapValues(_.size).toMap).value

    // 3. On each partition, sample the number of elements needed for that partition
    ar.mapPartitionsWithIndex{case(partitionIndex, iter) => 
      val numSamplesForPartition = samplesPerIndex.getOrElse(partitionIndex, 0)
      var ar = iter.map(x => (x.obj, x.weight)).toArray

      //Normalize to 1.0
      val Z = ar.map(x => x._2).sum
      ar = ar.map{case(obj, weight) => (obj, weight/Z)}
      sample(ar, numSamplesForPartition).toIterator
    }.collect()
  }

This code using a simple weighted sampling function sample : 该代码使用一个简单的加权采样函数sample

 // a very simple weighted sampling function 
  def sample[T:ClassTag](dist: Array[(T, Double)], numSamples: Int): Array[T] = {

    val probs = dist.zipWithIndex.map{case((elem,prob),idx) => (elem,prob,idx+1)}.sortBy(-_._2)
    val cumulativeDist = probs.map(_._2).scanLeft(0.0)(_+_).drop(1)
    (1 to numSamples).toArray.map(x => scala.util.Random.nextDouble).map{case(p) => 

      def findElem(p: Double, cumulativeDist: Array[Double]): Int = {
        for(i <- (0 until cumulativeDist.size-1)) 
          if (p <= cumulativeDist(i)) return i
        return cumulativeDist.size-1
      }

      probs(findElem(p, cumulativeDist))._1
    }
  }

This is basically doable, but you should really consider whether it makes sense to use Spark for this. 这基本上是可行的,但是您应该真正考虑为此使用Spark是否有意义。 If you need to draw random values, then you presumeably need to do so many times over in a loop. 如果需要绘制随机值,则可能需要循环执行多次。 Each iteration will require scanning through all the data (maybe more than once). 每次迭代都需要扫描所有数据(可能不止一次)。

So fitting the data you need into memory and then drawing values from it randomly is almost certainly the right way to go. 因此,将所需的数据装入内存,然后从中随机提取值几乎肯定是正确的方法。 If your data is really too big to fit into memory, consider (a) only collecting the columns you need for this purpose and (b) whether your data can be binned in a way that makes sense. 如果您的数据确实太大而无法容纳到内存中,请考虑(a)仅收集为此目的所需的列,以及(b)是否可以以有意义的方式对数据进行装箱。

Having said that, it is doable within Spark . 话虽如此,它在Spark 可行的。 Below is pysaprk code to demonstrate the idea. 以下是pysaprk代码以演示该想法。

import random
import pyspark.sql.functions as F
from pyspark.sql.window import Window
# read some sample data (shown below)
df = spark.read.csv("prb.csv",sep='\t',inferSchema=True,header=True)
# find the sum of the value column
ss = df.groupBy().agg( F.sum("vl").alias("sum") ).collect()
# add a column to store the normalized values
q = df.withColumn("nrm_vl", (df["vl"] / ss[0].sum) )
w = Window.partitionBy().orderBy("nrm_vl")\
          .rowsBetween(Window.unboundedPreceding, Window.currentRow)
q = q.select("*", F.sum("nrm_vl").over(w).alias("cum_vl"))
q.show()
+---+---+-------------------+-------------------+
| ky| vl|             nrm_vl|             cum_vl|
+---+---+-------------------+-------------------+
|  2|0.8|0.07079646017699115|0.07079646017699115|
|  3|1.1|0.09734513274336283|0.16814159292035397|
|  4|1.7|0.15044247787610618| 0.3185840707964601|
|  0|3.2| 0.2831858407079646| 0.6017699115044247|
|  1|4.5| 0.3982300884955752| 0.9999999999999999|
+---+---+-------------------+-------------------+

def getRandVl(q):
    # choose a random number and find the row that is
    # less than and nearest to the random number
    # (analog to `std::lower_bound` in C++)
    chvl = q.where( q["cum_vl"] > random.random() ).groupBy().agg(
        F.min(q["cum_vl"]).alias("cum_vl") )
    return q.join(chvl, on="cum_vl", how="inner")
# get 30 random samples.. this is already slow
# on a single machine.
for i in range(0,30):
    x = getRandVl(q)
    # add this row. there's no reason to do this (it's slow)
    # except that it's convenient to count how often each
    # key was chosen, to check if this method works
    cdf = cdf.select(cdf.columns).union(x.select(cdf.columns))

# count how often we picked each key
cdf.groupBy("ky","vl").agg( F.count("*").alias("count") ).show()
+---+---+-----+                                                                 
| ky| vl|count|
+---+---+-----+
|  4|1.7|    4|
|  2|0.8|    1|
|  3|1.1|    3|
|  0|3.2|   11|
|  1|4.5|   12|
+---+---+-----+

I think these counts are reasonable given the values. 考虑到这些值,我认为这些计数是合理的。 I'd rather test it with many more samples, but it's too slow. 我宁愿使用更多示例进行测试,但是它太慢了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM