How to calculate the minimum maximum and average values for each column in a dataset using MapReduce in pyspark?

Question

Assuming I have a large data set, here is an abbreviated part of it

Status,column1,column2,column3,column4
Healthy,4.5044,0.7443,6.34,1.9052
Patient,4.4284,0.9073,5.6433,1.6232
Patient,4.5291,1.0199,6.113,1.0565
Healthy,5.2258,0.6125,7.9504,0.1547
Healthy,4.8834,0.5786,5.6021,0.5942
Patient,5.7422,0.8862,5.1013,0.9402
Healthy,6.5076,0.5438,7.153,0.6711

I know the easiest way to do this is to use df.describe().show() in pyspark, but How can I use Mapreduce in pyspark to calculate the minimum maximum and average of each column?

Answer 1

df.describe() uses native Spark function to run the computation. You can explicitly use select expressions to get the results.

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

select_expr = []

for c in df.columns:
    for metric in [F.min, F.max, F.avg]:
        select_expr.append(metric(c))

df.select(*select_expr).show()

"""
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(Status)|max(Status)|avg(Status)|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|    Healthy|    Patient|       null|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857141|      5.1013|      7.9504|6.271871428571429|      0.1547|      1.9052|0.9921571428571428|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
"""

Answer 2

just group by column you want to find max,min and avg and use aggregate function:

import pyspark.sql.functions as F
    
df.groupBy(df.Status)\
.agg(F.max(df.column1), F.min(df.column1), F.avg(df.column1)).show()

Answer 3

It helps if you specify the output you want in your question or what you'll be using the output for, but the below should cover most use cases

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

columns = df.columns
columns.remove('Status')

cols_to_agg = [f(c) for c in columns for f in [F.min, F.max, F.avg]]

df.agg(*cols_to_agg)

+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+

or as Igor mentioned you could do a groupby to get a more granular breakdown

df.groupBy('status').agg(*cols_to_agg)

or if you want both use a rollup as this will give the result of both of the above in a single aggregation and output.

df.rollup('status').agg(*cols_to_agg+[F.grouping_id()]).show()
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
| status|min(column1)|max(column1)|      avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|grouping_id()|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
|Patient|      4.4284|      5.7422|            4.8999|      0.8862|      1.0199|0.9378000000000001|      5.1013|       6.113|5.619199999999999|      0.9402|      1.6232|1.2066333333333332|            0|
|Healthy|      4.5044|      6.5076|5.2802999999999995|      0.5438|      0.7443|            0.6198|      5.6021|      7.9504|         6.761375|      0.1547|      1.9052|0.8312999999999999|            0|
|   null|      4.4284|      6.5076| 5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|            1|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+

Note that df.agg is just an alias for df.groupBy().agg() and grouping_id is the null safe way to indicate which aggregation level a row belongs to

lastly you could consider putting the outputs in a more friendly format like a map column, which outputs the below

cols_to_agg = [F.create_map(f.__name__, f(c)) for f in [F.min, F.max, F.avg] for c in columns]
df.agg(*cols_to_agg)


+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|column1                                                 |column2                                                  |column3                                                 |column4                                                  |
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|{min -> 4.4284, max -> 6.5076, avg -> 5.117271428571429}|{min -> 0.5438, max -> 1.0199, avg -> 0.7560857142857144}|{min -> 5.1013, max -> 7.9504, avg -> 6.271871428571428}|{min -> 0.1547, max -> 1.9052, avg -> 0.9921571428571428}|
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+

which can then be collected to a python dictionary

df.agg(*cols_to_agg).collect()[0].asDict()


{'column1': {'avg': 5.117271428571429, 'max': 6.5076, 'min': 4.4284},
 'column2': {'avg': 0.7560857142857144, 'max': 1.0199, 'min': 0.5438},
 'column3': {'avg': 6.271871428571428, 'max': 7.9504, 'min': 5.1013},
 'column4': {'avg': 0.9921571428571428, 'max': 1.9052, 'min': 0.1547}}

How to calculate the minimum maximum and average values for each column in a dataset using MapReduce in pyspark?

Question

3 answers

solution1
0 2022-01-16 09:22:28

solution2
0 2022-01-16 10:52:48

solution3
0 2022-01-16 17:00:10

How to calculate the minimum maximum and average values for each column in a dataset using MapReduce in pyspark?

Question

3 answers

solution1 0 2022-01-16 09:22:28

solution2 0 2022-01-16 10:52:48

solution3 0 2022-01-16 17:00:10

solution1
0 2022-01-16 09:22:28

solution2
0 2022-01-16 10:52:48

solution3
0 2022-01-16 17:00:10