如何使用 pyspark 中的 MapReduce 计算数据集中每一列的最小最大值和平均值？

Question

Assuming I have a large data set, here is an abbreviated part of it假设我有一个大数据集，这里是它的一个缩写部分

Status,column1,column2,column3,column4
Healthy,4.5044,0.7443,6.34,1.9052
Patient,4.4284,0.9073,5.6433,1.6232
Patient,4.5291,1.0199,6.113,1.0565
Healthy,5.2258,0.6125,7.9504,0.1547
Healthy,4.8834,0.5786,5.6021,0.5942
Patient,5.7422,0.8862,5.1013,0.9402
Healthy,6.5076,0.5438,7.153,0.6711

I know the easiest way to do this is to use df.describe().show() in pyspark, but How can I use Mapreduce in pyspark to calculate the minimum maximum and average of each column?我知道最简单的方法是在 pyspark 中使用df.describe().show() ，但是如何在 pyspark 中使用 Mapreduce 来计算每列的最小值和平均值？

Answer 1

df.describe() uses native Spark function to run the computation. df.describe()使用原生 Spark function 来运行计算。 You can explicitly use select expressions to get the results.您可以显式使用 select 表达式来获取结果。

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

select_expr = []

for c in df.columns:
    for metric in [F.min, F.max, F.avg]:
        select_expr.append(metric(c))

df.select(*select_expr).show()

"""
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(Status)|max(Status)|avg(Status)|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|    Healthy|    Patient|       null|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857141|      5.1013|      7.9504|6.271871428571429|      0.1547|      1.9052|0.9921571428571428|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
"""

Answer 2

just group by column you want to find max,min and avg and use aggregate function:只需按您想要查找最大值、最小值和平均值的列分组，然后使用聚合 function：

import pyspark.sql.functions as F
    
df.groupBy(df.Status)\
.agg(F.max(df.column1), F.min(df.column1), F.avg(df.column1)).show()

Answer 3

It helps if you specify the output you want in your question or what you'll be using the output for, but the below should cover most use cases如果您在问题中指定您想要的 output 或您将使用 output 的用途，它会有所帮助，但以下内容应涵盖大多数用例

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

columns = df.columns
columns.remove('Status')

cols_to_agg = [f(c) for c in columns for f in [F.min, F.max, F.avg]]

df.agg(*cols_to_agg)

+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+

or as Igor mentioned you could do a groupby to get a more granular breakdown或者正如 Igor 提到的，你可以做一个 groupby 以获得更细化的细分

df.groupBy('status').agg(*cols_to_agg)

or if you want both use a rollup as this will give the result of both of the above in a single aggregation and output.或者，如果您希望两者都使用汇总，因为这将在单个聚合和 output 中给出上述两者的结果。

df.rollup('status').agg(*cols_to_agg+[F.grouping_id()]).show()
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
| status|min(column1)|max(column1)|      avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|grouping_id()|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
|Patient|      4.4284|      5.7422|            4.8999|      0.8862|      1.0199|0.9378000000000001|      5.1013|       6.113|5.619199999999999|      0.9402|      1.6232|1.2066333333333332|            0|
|Healthy|      4.5044|      6.5076|5.2802999999999995|      0.5438|      0.7443|            0.6198|      5.6021|      7.9504|         6.761375|      0.1547|      1.9052|0.8312999999999999|            0|
|   null|      4.4284|      6.5076| 5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|            1|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+

Note that df.agg is just an alias for df.groupBy().agg() and grouping_id is the null safe way to indicate which aggregation level a row belongs to请注意， df.agg只是df.groupBy().agg()的别名，而grouping_id是 null 安全方式来指示行属于哪个聚合级别

lastly you could consider putting the outputs in a more friendly format like a map column, which outputs the below最后，您可以考虑将输出设置为更友好的格式，例如 map 列，输出如下

cols_to_agg = [F.create_map(f.__name__, f(c)) for f in [F.min, F.max, F.avg] for c in columns]
df.agg(*cols_to_agg)


+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|column1                                                 |column2                                                  |column3                                                 |column4                                                  |
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|{min -> 4.4284, max -> 6.5076, avg -> 5.117271428571429}|{min -> 0.5438, max -> 1.0199, avg -> 0.7560857142857144}|{min -> 5.1013, max -> 7.9504, avg -> 6.271871428571428}|{min -> 0.1547, max -> 1.9052, avg -> 0.9921571428571428}|
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+

which can then be collected to a python dictionary然后可以将其收集到 python 字典

df.agg(*cols_to_agg).collect()[0].asDict()


{'column1': {'avg': 5.117271428571429, 'max': 6.5076, 'min': 4.4284},
 'column2': {'avg': 0.7560857142857144, 'max': 1.0199, 'min': 0.5438},
 'column3': {'avg': 6.271871428571428, 'max': 7.9504, 'min': 5.1013},
 'column4': {'avg': 0.9921571428571428, 'max': 1.9052, 'min': 0.1547}}

如何使用 pyspark 中的 MapReduce 计算数据集中每一列的最小最大值和平均值？

问题描述

3 个解决方案

解决方案1
0 2022-01-16 09:22:28

解决方案2
0 2022-01-16 10:52:48

解决方案3
0 2022-01-16 17:00:10

如何使用 pyspark 中的 MapReduce 计算数据集中每一列的最小最大值和平均值？

问题描述

3 个解决方案

解决方案1 0 2022-01-16 09:22:28

解决方案2 0 2022-01-16 10:52:48

解决方案3 0 2022-01-16 17:00:10

解决方案1
0 2022-01-16 09:22:28

解决方案2
0 2022-01-16 10:52:48

解决方案3
0 2022-01-16 17:00:10