[英]How to calculate the minimum maximum and average values for each column in a dataset using MapReduce in pyspark?
Assuming I have a large data set, here is an abbreviated part of it假设我有一个大数据集,这里是它的一个缩写部分
Status,column1,column2,column3,column4
Healthy,4.5044,0.7443,6.34,1.9052
Patient,4.4284,0.9073,5.6433,1.6232
Patient,4.5291,1.0199,6.113,1.0565
Healthy,5.2258,0.6125,7.9504,0.1547
Healthy,4.8834,0.5786,5.6021,0.5942
Patient,5.7422,0.8862,5.1013,0.9402
Healthy,6.5076,0.5438,7.153,0.6711
I know the easiest way to do this is to use df.describe().show() in pyspark, but How can I use Mapreduce in pyspark to calculate the minimum maximum and average of each column?我知道最简单的方法是在 pyspark 中使用df.describe().show() ,但是如何在 pyspark 中使用 Mapreduce 来计算每列的最小值和平均值?
df.describe()
uses native Spark function to run the computation. df.describe()
使用原生 Spark function 来运行计算。 You can explicitly use select expressions to get the results.您可以显式使用 select 表达式来获取结果。
from pyspark.sql import functions as F
data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))
select_expr = []
for c in df.columns:
for metric in [F.min, F.max, F.avg]:
select_expr.append(metric(c))
df.select(*select_expr).show()
"""
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(Status)|max(Status)|avg(Status)|min(column1)|max(column1)| avg(column1)|min(column2)|max(column2)| avg(column2)|min(column3)|max(column3)| avg(column3)|min(column4)|max(column4)| avg(column4)|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
| Healthy| Patient| null| 4.4284| 6.5076|5.117271428571429| 0.5438| 1.0199|0.7560857142857141| 5.1013| 7.9504|6.271871428571429| 0.1547| 1.9052|0.9921571428571428|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
"""
It helps if you specify the output you want in your question or what you'll be using the output for, but the below should cover most use cases如果您在问题中指定您想要的 output 或您将使用 output 的用途,它会有所帮助,但以下内容应涵盖大多数用例
from pyspark.sql import functions as F
data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))
columns = df.columns
columns.remove('Status')
cols_to_agg = [f(c) for c in columns for f in [F.min, F.max, F.avg]]
df.agg(*cols_to_agg)
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(column1)|max(column1)| avg(column1)|min(column2)|max(column2)| avg(column2)|min(column3)|max(column3)| avg(column3)|min(column4)|max(column4)| avg(column4)|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
| 4.4284| 6.5076|5.117271428571429| 0.5438| 1.0199|0.7560857142857144| 5.1013| 7.9504|6.271871428571428| 0.1547| 1.9052|0.9921571428571428|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
or as Igor mentioned you could do a groupby to get a more granular breakdown或者正如 Igor 提到的,你可以做一个 groupby 以获得更细化的细分
df.groupBy('status').agg(*cols_to_agg)
or if you want both use a rollup as this will give the result of both of the above in a single aggregation and output.或者,如果您希望两者都使用汇总,因为这将在单个聚合和 output 中给出上述两者的结果。
df.rollup('status').agg(*cols_to_agg+[F.grouping_id()]).show()
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
| status|min(column1)|max(column1)| avg(column1)|min(column2)|max(column2)| avg(column2)|min(column3)|max(column3)| avg(column3)|min(column4)|max(column4)| avg(column4)|grouping_id()|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
|Patient| 4.4284| 5.7422| 4.8999| 0.8862| 1.0199|0.9378000000000001| 5.1013| 6.113|5.619199999999999| 0.9402| 1.6232|1.2066333333333332| 0|
|Healthy| 4.5044| 6.5076|5.2802999999999995| 0.5438| 0.7443| 0.6198| 5.6021| 7.9504| 6.761375| 0.1547| 1.9052|0.8312999999999999| 0|
| null| 4.4284| 6.5076| 5.117271428571429| 0.5438| 1.0199|0.7560857142857144| 5.1013| 7.9504|6.271871428571428| 0.1547| 1.9052|0.9921571428571428| 1|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
Note that df.agg
is just an alias for df.groupBy().agg()
and grouping_id is the null safe way to indicate which aggregation level a row belongs to请注意,
df.agg
只是df.groupBy().agg()
的别名,而grouping_id是 null 安全方式来指示行属于哪个聚合级别
lastly you could consider putting the outputs in a more friendly format like a map column, which outputs the below最后,您可以考虑将输出设置为更友好的格式,例如 map 列,输出如下
cols_to_agg = [F.create_map(f.__name__, f(c)) for f in [F.min, F.max, F.avg] for c in columns]
df.agg(*cols_to_agg)
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|column1 |column2 |column3 |column4 |
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|{min -> 4.4284, max -> 6.5076, avg -> 5.117271428571429}|{min -> 0.5438, max -> 1.0199, avg -> 0.7560857142857144}|{min -> 5.1013, max -> 7.9504, avg -> 6.271871428571428}|{min -> 0.1547, max -> 1.9052, avg -> 0.9921571428571428}|
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
which can then be collected to a python dictionary然后可以将其收集到 python 字典
df.agg(*cols_to_agg).collect()[0].asDict()
{'column1': {'avg': 5.117271428571429, 'max': 6.5076, 'min': 4.4284},
'column2': {'avg': 0.7560857142857144, 'max': 1.0199, 'min': 0.5438},
'column3': {'avg': 6.271871428571428, 'max': 7.9504, 'min': 5.1013},
'column4': {'avg': 0.9921571428571428, 'max': 1.9052, 'min': 0.1547}}
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.