簡體   English   中英

如何使用 pyspark 中的 MapReduce 計算數據集中每一列的最小最大值和平均值?

[英]How to calculate the minimum maximum and average values for each column in a dataset using MapReduce in pyspark?

假設我有一個大數據集,這里是它的一個縮寫部分

Status,column1,column2,column3,column4
Healthy,4.5044,0.7443,6.34,1.9052
Patient,4.4284,0.9073,5.6433,1.6232
Patient,4.5291,1.0199,6.113,1.0565
Healthy,5.2258,0.6125,7.9504,0.1547
Healthy,4.8834,0.5786,5.6021,0.5942
Patient,5.7422,0.8862,5.1013,0.9402
Healthy,6.5076,0.5438,7.153,0.6711

我知道最簡單的方法是在 pyspark 中使用df.describe().show() ,但是如何在 pyspark 中使用 Mapreduce 來計算每列的最小值和平均值?

df.describe()使用原生 Spark function 來運行計算。 您可以顯式使用 select 表達式來獲取結果。

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

select_expr = []

for c in df.columns:
    for metric in [F.min, F.max, F.avg]:
        select_expr.append(metric(c))

df.select(*select_expr).show()

"""
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(Status)|max(Status)|avg(Status)|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|    Healthy|    Patient|       null|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857141|      5.1013|      7.9504|6.271871428571429|      0.1547|      1.9052|0.9921571428571428|
+-----------+-----------+-----------+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
"""

只需按您想要查找最大值、最小值和平均值的列分組,然后使用聚合 function:

import pyspark.sql.functions as F
    
df.groupBy(df.Status)\
.agg(F.max(df.column1), F.min(df.column1), F.avg(df.column1)).show()

在此處輸入圖像描述

如果您在問題中指定您想要的 output 或您將使用 output 的用途,它會有所幫助,但以下內容應涵蓋大多數用例

from pyspark.sql import functions as F

data = [("Healthy", 4.5044, 0.7443, 6.34, 1.9052,),
        ("Patient", 4.4284, 0.9073, 5.6433, 1.6232,),
        ("Patient", 4.5291, 1.0199, 6.113, 1.0565,),
        ("Healthy", 5.2258, 0.6125, 7.9504, 0.1547,),
        ("Healthy", 4.8834, 0.5786, 5.6021, 0.5942,),
        ("Patient", 5.7422, 0.8862, 5.1013, 0.9402,),
        ("Healthy", 6.5076, 0.5438, 7.153, 0.6711,), ]
df = spark.createDataFrame(data, ("Status", "column1", "column2", "column3", "column4", ))

columns = df.columns
columns.remove('Status')

cols_to_agg = [f(c) for c in columns for f in [F.min, F.max, F.avg]]

df.agg(*cols_to_agg)

+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|min(column1)|max(column1)|     avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+
|      4.4284|      6.5076|5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|
+------------+------------+-----------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+

或者正如 Igor 提到的,你可以做一個 groupby 以獲得更細化的細分

df.groupBy('status').agg(*cols_to_agg)

或者,如果您希望兩者都使用匯總,因為這將在單個聚合和 output 中給出上述兩者的結果。

df.rollup('status').agg(*cols_to_agg+[F.grouping_id()]).show()
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
| status|min(column1)|max(column1)|      avg(column1)|min(column2)|max(column2)|      avg(column2)|min(column3)|max(column3)|     avg(column3)|min(column4)|max(column4)|      avg(column4)|grouping_id()|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+
|Patient|      4.4284|      5.7422|            4.8999|      0.8862|      1.0199|0.9378000000000001|      5.1013|       6.113|5.619199999999999|      0.9402|      1.6232|1.2066333333333332|            0|
|Healthy|      4.5044|      6.5076|5.2802999999999995|      0.5438|      0.7443|            0.6198|      5.6021|      7.9504|         6.761375|      0.1547|      1.9052|0.8312999999999999|            0|
|   null|      4.4284|      6.5076| 5.117271428571429|      0.5438|      1.0199|0.7560857142857144|      5.1013|      7.9504|6.271871428571428|      0.1547|      1.9052|0.9921571428571428|            1|
+-------+------------+------------+------------------+------------+------------+------------------+------------+------------+-----------------+------------+------------+------------------+-------------+

請注意, df.agg只是df.groupBy().agg()的別名,而grouping_id是 null 安全方式來指示行屬於哪個聚合級別

最后,您可以考慮將輸出設置為更友好的格式,例如 map 列,輸出如下

cols_to_agg = [F.create_map(f.__name__, f(c)) for f in [F.min, F.max, F.avg] for c in columns]
df.agg(*cols_to_agg)


+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|column1                                                 |column2                                                  |column3                                                 |column4                                                  |
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+
|{min -> 4.4284, max -> 6.5076, avg -> 5.117271428571429}|{min -> 0.5438, max -> 1.0199, avg -> 0.7560857142857144}|{min -> 5.1013, max -> 7.9504, avg -> 6.271871428571428}|{min -> 0.1547, max -> 1.9052, avg -> 0.9921571428571428}|
+--------------------------------------------------------+---------------------------------------------------------+--------------------------------------------------------+---------------------------------------------------------+

然后可以將其收集到 python 字典

df.agg(*cols_to_agg).collect()[0].asDict()


{'column1': {'avg': 5.117271428571429, 'max': 6.5076, 'min': 4.4284},
 'column2': {'avg': 0.7560857142857144, 'max': 1.0199, 'min': 0.5438},
 'column3': {'avg': 6.271871428571428, 'max': 7.9504, 'min': 5.1013},
 'column4': {'avg': 0.9921571428571428, 'max': 1.9052, 'min': 0.1547}}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM