Assuming I have a large data set, here is an abbreviated part of it
Status,column1,column2,column3,column4
Healthy,4.5044,0.7443,6.34,1.9052
Patient,4.4284,0.9073,5.6433,1.6232
Patient,4.5291,1.0199,6.113,1.0565
Healthy,5.2258,0.6125,7.9504,0.1547
Healthy,4.8834,0.5786,5.6021,0.5942
Patient,5.7422,0.8862,5.1013,0.9402
Healthy,6.5076,0.5438,7.153,0.6711
I know the easiest way to do this is to use df.describe().show() in pyspark, but How can I use Mapreduce in pyspark to calculate the minimum maximum and average of each column?
df.describe()
uses native Spark function to run the computation. You can explicitly use select expressions to get the results.
from pyspark.sql import functions as F
data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))
select_expr = []
for c in df.columns:
for metric in [F.min, F.max, F.avg]:
select_expr.append(metric(c))
df.select(*select_expr).show()
"""
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(Status)|max(Status)|avg(Status)|min(column1)|max(column1)| avg(column1)|min(column2)|max(column2)| avg(column2)|min(column3)|max(column3)| avg(column3)|min(column4)|max(column4)| avg(column4)|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
| Healthy| Patient| null| 4.4284| 6.5076|5.117271428571429| 0.5438| 1.0199|0.7560857142857141| 5.1013| 7.9504|6.271871428571429| 0.1547| 1.9052|0.9921571428571428|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
"""
It helps if you specify the output you want in your question or what you'll be using the output for, but the below should cover most use cases
from pyspark.sql import functions as F
data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))
columns = df.columns
columns.remove('Status')
cols_to_agg = [f(c) for c in columns for f in [F.min, F.max, F.avg]]
df.agg(*cols_to_agg)
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(column1)|max(column1)| avg(column1)|min(column2)|max(column2)| avg(column2)|min(column3)|max(column3)| avg(column3)|min(column4)|max(column4)| avg(column4)|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
| 4.4284| 6.5076|5.117271428571429| 0.5438| 1.0199|0.7560857142857144| 5.1013| 7.9504|6.271871428571428| 0.1547| 1.9052|0.9921571428571428|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
or as Igor mentioned you could do a groupby to get a more granular breakdown
df.groupBy('status').agg(*cols_to_agg)
or if you want both use a rollup as this will give the result of both of the above in a single aggregation and output.
df.rollup('status').agg(*cols_to_agg+[F.grouping_id()]).show()
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
| status|min(column1)|max(column1)| avg(column1)|min(column2)|max(column2)| avg(column2)|min(column3)|max(column3)| avg(column3)|min(column4)|max(column4)| avg(column4)|grouping_id()|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
|Patient| 4.4284| 5.7422| 4.8999| 0.8862| 1.0199|0.9378000000000001| 5.1013| 6.113|5.619199999999999| 0.9402| 1.6232|1.2066333333333332| 0|
|Healthy| 4.5044| 6.5076|5.2802999999999995| 0.5438| 0.7443| 0.6198| 5.6021| 7.9504| 6.761375| 0.1547| 1.9052|0.8312999999999999| 0|
| null| 4.4284| 6.5076| 5.117271428571429| 0.5438| 1.0199|0.7560857142857144| 5.1013| 7.9504|6.271871428571428| 0.1547| 1.9052|0.9921571428571428| 1|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
Note that df.agg
is just an alias for df.groupBy().agg()
and grouping_id is the null safe way to indicate which aggregation level a row belongs to
lastly you could consider putting the outputs in a more friendly format like a map column, which outputs the below
cols_to_agg = [F.create_map(f.__name__, f(c)) for f in [F.min, F.max, F.avg] for c in columns]
df.agg(*cols_to_agg)
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|column1 |column2 |column3 |column4 |
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|{min -> 4.4284, max -> 6.5076, avg -> 5.117271428571429}|{min -> 0.5438, max -> 1.0199, avg -> 0.7560857142857144}|{min -> 5.1013, max -> 7.9504, avg -> 6.271871428571428}|{min -> 0.1547, max -> 1.9052, avg -> 0.9921571428571428}|
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
which can then be collected to a python dictionary
df.agg(*cols_to_agg).collect()[0].asDict()
{'column1': {'avg': 5.117271428571429, 'max': 6.5076, 'min': 4.4284},
'column2': {'avg': 0.7560857142857144, 'max': 1.0199, 'min': 0.5438},
'column3': {'avg': 6.271871428571428, 'max': 7.9504, 'min': 5.1013},
'column4': {'avg': 0.9921571428571428, 'max': 1.9052, 'min': 0.1547}}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.