简体   繁体   中英

How to calculate the minimum maximum and average values for each column in a dataset using MapReduce in pyspark?

Assuming I have a large data set, here is an abbreviated part of it

Status,column1,column2,column3,column4
Healthy,4.5044,0.7443,6.34,1.9052
Patient,4.4284,0.9073,5.6433,1.6232
Patient,4.5291,1.0199,6.113,1.0565
Healthy,5.2258,0.6125,7.9504,0.1547
Healthy,4.8834,0.5786,5.6021,0.5942
Patient,5.7422,0.8862,5.1013,0.9402
Healthy,6.5076,0.5438,7.153,0.6711

I know the easiest way to do this is to use df.describe().show() in pyspark, but How can I use Mapreduce in pyspark to calculate the minimum maximum and average of each column?

df.describe() uses native Spark function to run the computation. You can explicitly use select expressions to get the results.

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

select_expr = []

for c in df.columns:
    for metric in [F.min, F.max, F.avg]:
        select_expr.append(metric(c))

df.select(*select_expr).show()

"""
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(Status)|max(Status)|avg(Status)|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|    Healthy|    Patient|       null|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857141|      5.1013|      7.9504|6.271871428571429|      0.1547|      1.9052|0.9921571428571428|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
"""

just group by column you want to find max,min and avg and use aggregate function:

import pyspark.sql.functions as F
    
df.groupBy(df.Status)\
.agg(F.max(df.column1), F.min(df.column1), F.avg(df.column1)).show()

在此处输入图像描述

It helps if you specify the output you want in your question or what you'll be using the output for, but the below should cover most use cases

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

columns = df.columns
columns.remove('Status')

cols_to_agg = [f(c) for c in columns for f in [F.min, F.max, F.avg]]

df.agg(*cols_to_agg)

+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+

or as Igor mentioned you could do a groupby to get a more granular breakdown

df.groupBy('status').agg(*cols_to_agg)

or if you want both use a rollup as this will give the result of both of the above in a single aggregation and output.

df.rollup('status').agg(*cols_to_agg+[F.grouping_id()]).show()
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
| status|min(column1)|max(column1)|      avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|grouping_id()|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
|Patient|      4.4284|      5.7422|            4.8999|      0.8862|      1.0199|0.9378000000000001|      5.1013|       6.113|5.619199999999999|      0.9402|      1.6232|1.2066333333333332|            0|
|Healthy|      4.5044|      6.5076|5.2802999999999995|      0.5438|      0.7443|            0.6198|      5.6021|      7.9504|         6.761375|      0.1547|      1.9052|0.8312999999999999|            0|
|   null|      4.4284|      6.5076| 5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|            1|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+

Note that df.agg is just an alias for df.groupBy().agg() and grouping_id is the null safe way to indicate which aggregation level a row belongs to

lastly you could consider putting the outputs in a more friendly format like a map column, which outputs the below

cols_to_agg = [F.create_map(f.__name__, f(c)) for f in [F.min, F.max, F.avg] for c in columns]
df.agg(*cols_to_agg)


+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|column1                                                 |column2                                                  |column3                                                 |column4                                                  |
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|{min -> 4.4284, max -> 6.5076, avg -> 5.117271428571429}|{min -> 0.5438, max -> 1.0199, avg -> 0.7560857142857144}|{min -> 5.1013, max -> 7.9504, avg -> 6.271871428571428}|{min -> 0.1547, max -> 1.9052, avg -> 0.9921571428571428}|
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+

which can then be collected to a python dictionary

df.agg(*cols_to_agg).collect()[0].asDict()


{'column1': {'avg': 5.117271428571429, 'max': 6.5076, 'min': 4.4284},
 'column2': {'avg': 0.7560857142857144, 'max': 1.0199, 'min': 0.5438},
 'column3': {'avg': 6.271871428571428, 'max': 7.9504, 'min': 5.1013},
 'column4': {'avg': 0.9921571428571428, 'max': 1.9052, 'min': 0.1547}}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM