简体   繁体   English

如何使用 Spark 查找中位数和分位数

[英]How to find median and quantiles using Spark

How can I find median of an RDD of integers using a distributed method, IPython, and Spark?如何使用分布式方法、IPython 和 Spark 找到整数RDD的中值? The RDD is approximately 700,000 elements and therefore too large to collect and find the median. RDD大约有 700,000 个元素,因此太大而无法收集和找到中位数。

This question is similar to this question.这个问题类似于这个问题。 However, the answer to the question is using Scala, which I do not know.但是,问题的答案是使用 Scala,我不知道。

How can I calculate exact median with Apache Spark? 如何使用 Apache Spark 计算准确的中位数?

Using the thinking for the Scala answer, I am trying to write a similar answer in Python.使用 Scala 答案的想法,我试图用 Python 编写类似的答案。

I know I first want to sort the RDD .我知道我首先要对RDD进行排序。 I do not know how.我不知道怎么。 I see the sortBy (Sorts this RDD by the given keyfunc ) and sortByKey (Sorts this RDD , which is assumed to consist of (key, value) pairs.) methods.我看到了sortBy (按给定的keyfunc对这个 RDD 进行排序)和sortByKey (对这个RDD进行排序,假定它由(键,值)对组成。)方法。 I think both use key value and my RDD only has integer elements.我认为两者都使用键值,而我的RDD只有整数元素。

  1. First, I was thinking of doing myrdd.sortBy(lambda x: x) ?首先,我在考虑做myrdd.sortBy(lambda x: x)
  2. Next I will find the length of the rdd ( rdd.count() ).接下来我将找到 rdd ( rdd.count() ) 的长度。
  3. Finally, I want to find the element or 2 elements at the center of the rdd.最后,我想找到rdd中心的元素或2个元素。 I need help with this method too.我也需要这种方法的帮助。

EDIT:编辑:

I had an idea.我有个主意。 Maybe I can index my RDD and then key = index and value = element.也许我可以索引我的RDD ,然后 key = index 和 value = element。 And then I can try to sort by value?然后我可以尝试按值排序? I don't know if this is possible because there is only a sortByKey method.我不知道这是否可能,因为只有一个sortByKey方法。

Ongoing work正在进行的工作

SPARK-30569 - Add DSL functions invoking percentile_approx SPARK-30569 -添加调用 percentile_approx 的 DSL 函数

Spark 2.0+:火花 2.0+:

You can use approxQuantile method which implements Greenwald-Khanna algorithm :您可以使用实现Greenwald-Khanna 算法的approxQuantile方法:

Python :蟒蛇

df.approxQuantile("x", [0.5], 0.25)

Scala :斯卡拉

df.stat.approxQuantile("x", Array(0.5), 0.25)

where the last parameter is a relative error.其中最后一个参数是一个相对误差。 The lower the number the more accurate results and more expensive computation.数字越小,结果越准确,计算成本越高。

Since Spark 2.2 ( SPARK-14352 ) it supports estimation on multiple columns:从 Spark 2.2 ( SPARK-14352 ) 开始,它支持对多列的估计:

df.approxQuantile(["x", "y", "z"], [0.5], 0.25)

and

df.approxQuantile(Array("x", "y", "z"), Array(0.5), 0.25)

Underlying methods can be also used in SQL aggregation (both global and groped) using approx_percentile function:底层方法也可以使用approx_percentile函数用于 SQL 聚合(全局和approx_percentile ):

> SELECT approx_percentile(10.0, array(0.5, 0.4, 0.1), 100);
 [10.0,10.0,10.0]
> SELECT approx_percentile(10.0, 0.5, 100);
 10.0

Spark < 2.0火花 < 2.0

Python Python

As I've mentioned in the comments it is most likely not worth all the fuss.正如我在评论中提到的,这很可能不值得大惊小怪。 If data is relatively small like in your case then simply collect and compute median locally:如果数据相对较小,例如您的情况,则只需在本地收集和计算中位数:

import numpy as np

np.random.seed(323)
rdd = sc.parallelize(np.random.randint(1000000, size=700000))

%time np.median(rdd.collect())
np.array(rdd.collect()).nbytes

It takes around 0.01 second on my few years old computer and around 5.5MB of memory.在我几年前的电脑和大约 5.5MB 的内存上大约需要 0.01 秒。

If data is much larger sorting will be a limiting factor so instead of getting an exact value it is probably better to sample, collect, and compute locally.如果数据更大,排序将是一个限制因素,因此与其获取精确值,不如在本地进行采样、收集和计算。 But if you really want a to use Spark something like this should do the trick (if I didn't mess up anything):但是,如果你真的想使用 Spark,这样的事情应该可以解决问题(如果我没有搞砸任何事情):

from numpy import floor
import time

def quantile(rdd, p, sample=None, seed=None):
    """Compute a quantile of order p ∈ [0, 1]
    :rdd a numeric rdd
    :p quantile(between 0 and 1)
    :sample fraction of and rdd to use. If not provided we use a whole dataset
    :seed random number generator seed to be used with sample
    """
    assert 0 <= p <= 1
    assert sample is None or 0 < sample <= 1

    seed = seed if seed is not None else time.time()
    rdd = rdd if sample is None else rdd.sample(False, sample, seed)

    rddSortedWithIndex = (rdd.
        sortBy(lambda x: x).
        zipWithIndex().
        map(lambda (x, i): (i, x)).
        cache())

    n = rddSortedWithIndex.count()
    h = (n - 1) * p

    rddX, rddXPlusOne = (
        rddSortedWithIndex.lookup(x)[0]
        for x in int(floor(h)) + np.array([0L, 1L]))

    return rddX + (h - floor(h)) * (rddXPlusOne - rddX)

And some tests:还有一些测试:

np.median(rdd.collect()), quantile(rdd, 0.5)
## (500184.5, 500184.5)
np.percentile(rdd.collect(), 25), quantile(rdd, 0.25)
## (250506.75, 250506.75)
np.percentile(rdd.collect(), 75), quantile(rdd, 0.75)
(750069.25, 750069.25)

Finally lets define median:最后让我们定义中位数:

from functools import partial
median = partial(quantile, p=0.5)

So far so good but it takes 4.66 s in a local mode without any network communication.到目前为止一切顺利,但在没有任何网络通信的本地模式下需要 4.66 秒。 There is probably way to improve this, but why even bother?可能有办法改善这一点,但为什么还要麻烦呢?

Language independent ( Hive UDAF ):语言无关Hive UDAF ):

If you use HiveContext you can also use Hive UDAFs.如果您使用HiveContext您也可以使用 Hive UDAF。 With integral values:使用积分值:

rdd.map(lambda x: (float(x), )).toDF(["x"]).registerTempTable("df")

sqlContext.sql("SELECT percentile_approx(x, 0.5) FROM df")

With continuous values:具有连续值:

sqlContext.sql("SELECT percentile(x, 0.5) FROM df")

In percentile_approx you can pass an additional argument which determines a number of records to use.percentile_approx您可以传递一个附加参数来确定要使用的记录数。

Here is the method I used using window functions (with pyspark 2.2.0).这是我使用窗口函数(使用 pyspark 2.2.0)使用的方法。

from pyspark.sql import DataFrame

class median():
    """ Create median class with over method to pass partition """
    def __init__(self, df, col, name):
        assert col
        self.column=col
        self.df = df
        self.name = name

    def over(self, window):
        from pyspark.sql.functions import percent_rank, pow, first

        first_window = window.orderBy(self.column)                                  # first, order by column we want to compute the median for
        df = self.df.withColumn("percent_rank", percent_rank().over(first_window))  # add percent_rank column, percent_rank = 0.5 coressponds to median
        second_window = window.orderBy(pow(df.percent_rank-0.5, 2))                 # order by (percent_rank - 0.5)^2 ascending
        return df.withColumn(self.name, first(self.column).over(second_window))     # the first row of the window corresponds to median

def addMedian(self, col, median_name):
    """ Method to be added to spark native DataFrame class """
    return median(self, col, median_name)

# Add method to DataFrame class
DataFrame.addMedian = addMedian

Then call the addMedian method to calculate the median of col2:然后调用addMedian方法计算col2的中位数:

from pyspark.sql import Window

median_window = Window.partitionBy("col1")
df = df.addMedian("col2", "median").over(median_window)

Finally you can group by if needed.最后,您可以根据需要进行分组。

df.groupby("col1", "median")

Adding a solution if you want an RDD method only and dont want to move to DF.如果您只想要 RDD 方法并且不想移动到 DF,请添加解决方案。 This snippet can get you a percentile for an RDD of double.此代码段可以为您提供双倍 RDD 的百分位数。

If you input percentile as 50, you should obtain your required median.如果您输入百分位数为 50,您应该获得所需的中位数。 Let me know if there are any corner cases not accounted for.让我知道是否有任何未考虑的极端情况。

/**
  * Gets the nth percentile entry for an RDD of doubles
  *
  * @param inputScore : Input scores consisting of a RDD of doubles
  * @param percentile : The percentile cutoff required (between 0 to 100), e.g 90%ile of [1,4,5,9,19,23,44] = ~23.
  *                     It prefers the higher value when the desired quantile lies between two data points
  * @return : The number best representing the percentile in the Rdd of double
  */    
  def getRddPercentile(inputScore: RDD[Double], percentile: Double): Double = {
    val numEntries = inputScore.count().toDouble
    val retrievedEntry = (percentile * numEntries / 100.0 ).min(numEntries).max(0).toInt


    inputScore
      .sortBy { case (score) => score }
      .zipWithIndex()
      .filter { case (score, index) => index == retrievedEntry }
      .map { case (score, index) => score }
      .collect()(0)
  }

I have written the function which takes data frame as an input and returns a dataframe which has median as an output over a partition and order_col is the column for which we want to calculate median for part_col is the level at which we want to calculate median for :我编写了一个函数,它将数据帧作为输入并返回一个数据帧,该数据帧将中值作为分区上的输出,而 order_col 是我们要为其计算中值的列 part_col 是我们要计算中值的级别:

from pyspark.sql import Window
import pyspark.sql.functions as F

def calculate_median(dataframe, part_col, order_col):
    win = Window.partitionBy(*part_col).orderBy(order_col)
#     count_row = dataframe.groupby(*part_col).distinct().count()
    dataframe.persist()
    dataframe.count()
    temp = dataframe.withColumn("rank", F.row_number().over(win))
    temp = temp.withColumn(
        "count_row_part",
        F.count(order_col).over(Window.partitionBy(part_col))
    )
    temp = temp.withColumn(
        "even_flag",
        F.when(
            F.col("count_row_part") %2 == 0,
            F.lit(1)
        ).otherwise(
            F.lit(0)
        )
    ).withColumn(
        "mid_value",
        F.floor(F.col("count_row_part")/2)
    )

    temp = temp.withColumn(
        "avg_flag",
        F.when(
            (F.col("even_flag")==1) &
            (F.col("rank") == F.col("mid_value"))|
            ((F.col("rank")-1) == F.col("mid_value")),
            F.lit(1)
        ).otherwise(
        F.when(
            F.col("rank") == F.col("mid_value")+1,
            F.lit(1)
            )
        )
    )
    temp.show(10)
    return temp.filter(
        F.col("avg_flag") == 1
    ).groupby(
        part_col + ["avg_flag"]
    ).agg(
        F.avg(F.col(order_col)).alias("median")
    ).drop("avg_flag")

There are two ways that can be used.有两种方法可以使用。 One is using approxQuantile method and the other percentile_approx method.一种是使用approxQuantile方法,另一种是使用percentile_approx方法。 However, both the methods might not give accurate results when there are even number of records.但是,当记录数为偶数时,这两种方法都可能无法给出准确的结果。

importpyspark.sql.functions.percentile_approx as F
# df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5).alias("MEDIAN)) # might not give proper results when there are even number of records

((
df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.5) + df.select(F.percentile_approx("COLUMN_NAME_FOR_WHICH_MEDIAN_TO_BE_COMPUTED", 0.51)
)*.5).alias("MEDIAN))

For exact median computation you can use the following function and use it with PySpark DataFrame API:对于精确的中值计算,您可以使用以下函数并将其与 PySpark DataFrame API 一起使用:

def median_exact(col: Union[Column, str]) -> Column:
    """
    For grouped aggregations, Spark provides a way via pyspark.sql.functions.percentile_approx("col", .5) function,
    since for large datasets, computing the median is computationally expensive.
    This function manually computes the median and should only be used for small to mid sized datasets / groupings.
    :param col: Column to compute the median for.
    :return: A pyspark `Column` containing the median calculation expression
    """
    list_expr = F.filter(F.collect_list(col), lambda x: x.isNotNull())
    sorted_list_expr = F.sort_array(list_expr)
    size_expr = F.size(sorted_list_expr)

    even_num_elements = (size_expr % 2) == 0
    odd_num_elements = ~even_num_elements

    return F.when(size_expr == 0, None).otherwise(
        F.when(odd_num_elements, sorted_list_expr[F.floor(size_expr / 2)]).otherwise(
            (
                sorted_list_expr[(size_expr / 2 - 1).cast("long")]
                + sorted_list_expr[(size_expr / 2).cast("long")]
            )
            / 2
        )
    )

Apply it like this:像这样应用它:

output_df = input_spark_df.groupby("group").agg(
    median_exact("elems").alias("elems_median")
)

We can calculate the median and quantiles in spark using the df.stat.approxQuantile(col,[quantiles],error)我们可以使用df.stat.approxQuantile(col,[quantiles],error)计算 spark 中的中位数和分位数

For example, finding the median in this data frame.例如,在此数据框中查找中位数。 [1,2,3,4,5] [1,2,3,4,5]

df.stat.approxQuantile(col,[0.5],0) df.stat.approxQuantile(col,[0.5],0)

The lesser the error, the more accurate the results误差越小,结果越准确

From version 3.4+ (and also already in 3.3.1) the median function is directly available https://github.com/apache/spark/blob/e170a2eb236a376b036730b5d63371e753f1d947/python/pyspark/sql/functions.py#L633从版本 3.4+(并且已经在 3.3.1 中)开始,中值函数直接可用https://github.com/apache/spark/blob/e170a2eb236a376b036730b5d63371e753f1d947/python/pyspark/sql/functions.py#L633

import pyspark.sql.functions as f

df.groupBy("grp").agg(f.median("val"))

I guess the respective documentation will be added if the version is finally released.我想如果版本最终发布,将添加相应的文档。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM