[英]How to find the median in Apache Spark with Python Dataframe API?
PySpark API provides many aggregate functions except the median. PySpark API 提供了除中位数以外的许多聚合函数。 Spark 2 comes with approxQuantile
which gives approximate quantiles but exact median is very expensive to calculate. Spark 2 带有approxQuantile
,它给出了近似分位数,但精确的中位数计算起来非常昂贵。 Is there a more PySpark way of calculating median for a column of values in a Spark Dataframe?是否有更多的 PySpark 方法来计算 Spark Dataframe 中一列值的中位数?
Here is an example implementation with Dataframe API in Python (Spark 1.6 +). 这是Python(Spark 1.6 +)中使用Dataframe API的示例实现。
import pyspark.sql.functions as F
import numpy as np
from pyspark.sql.types import FloatType
Let's assume we have monthly salaries for customers in "salaries" spark dataframe such as: 假设我们在“工资” spark数据帧中有客户的月薪,例如:
month | 一个月 customer_id | customer_id | salary 薪水
and we would like to find the median salary per customer throughout all the months 并且我们希望找到所有客户在整个月的平均工资
Step1: Write a user defined function to calculate the median 步骤1:编写用户定义的函数以计算中位数
def find_median(values_list):
try:
median = np.median(values_list) #get the median of values in a list in each row
return round(float(median),2)
except Exception:
return None #if there is anything wrong with the given values
median_finder = F.udf(find_median,FloatType())
Step 2: Aggregate on the salary column by collecting them into a list of salaries in each row: 步骤2:通过将薪金列收集到每一行的薪金列表中,进行汇总:
salaries_list = salaries.groupBy("customer_id").agg(F.collect_list("salary").alias("salaries"))
Step 3: Call the median_finder udf on the salaries column and add the median values as a new column 步骤3:在salaries列上调用mean_finder udf并将中位数值添加为新列
salaries_list = salaries_list.withColumn("median",median_finder("salaries"))
For exact median (for small-mid sized dataframes), since Spark 2.1 one can use percentile
function wrapped in expr
:对于精确的中位数(对于中小型数据帧),由于Spark 2.1可以使用包裹在expr
中的percentile
function:
F.expr('percentile(c2, 0.5)')
df = spark.createDataFrame(
[(1, 10),
(1, 20),
(2, 50)],
['c1', 'c2'])
df.groupby('c1').agg(F.expr('percentile(c2, 0.5)').alias('median')).show()
# +---+------+
# | c1|median|
# +---+------+
# | 1| 15.0|
# | 2| 50.0|
# +---+------+
df.withColumn('median', F.expr('percentile(c2, 0.5)').over(W.partitionBy('c1'))).show()
# +---+---+------+
# | c1| c2|median|
# +---+---+------+
# | 1| 10| 15.0|
# | 1| 20| 15.0|
# | 2| 50| 50.0|
# +---+---+------+
Approximate median can often be a better choice for mid-large sized dataframes.对于中大型数据帧,近似中位数通常是更好的选择。
Spark 2.1 implements approx_percentile
and percentile_approx
: Spark 2.1实现approx_percentile
和percentile_approx
:
F.expr('percentile_approx(c2, 0.5)')
Since Spark 3.1 one can use it in PySpark API directly:由于Spark 3.1可以直接在PySpark API 中使用它:
F.percentile_approx('c2', 0.5)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.