组聚合 PySpark 中的算术减法

Question

I have the following dataframe:我有以下数据框：

ID val1 val2 val3 ...
1   4    1    3   ...
1   5    4    8   ...
2   6    3    6   ...
2   9    2    2   ...
3   2    1    4   ...
3   1    1    4   ...

I need to group/aggregate by ID and subtract the values, producing the following output:我需要按 ID 分组/聚合并减去值，产生以下输出：

ID val1 val2 val3 ...
1   -1   -3  -5   ... 
2   -3    1   4   ...
3    1    0   0   ...

My current approach would produce the desired output for 1 column at a time:我目前的方法会一次为 1 列生成所需的输出：

from pyspark.sql.functions import first, last
output = df.groupBy('id').agg(first('val1') - (last(col('val1'))))

However, my data set has numerous columns and I would need to find a clean way to do it for all columns.但是，我的数据集有很多列，我需要为所有列找到一种干净的方法。

Answer 1

Check below code.检查下面的代码。

df
.groupBy(col("id"))
.agg(
    (first(col("val1")) - last(col("val1"))).as("val1"),
    (first(col("val2")) - last(col("val2"))).as("val2"),
    (first(col("val3")) - last(col("val3"))).as("val3")
)
.orderBy(col("id"), ascending=True)
.show(false)

+---+----+----+----+
|id |val1|val2|val3|
+---+----+----+----+
|1  |-1  |-3  |-5  |
|2  |-3  |1   |4   |
|3  |1   |0   |0   |
+---+----+----+----+

aggCols = map(lambda c: (first(col(c)) - last(col(c))).alias(c),filter(lambda c: c != "id", df.columns))

df.groupBy(col("id")).agg(*aggCols).show()
+---+----+----+----+
| id|val1|val2|val3|
+---+----+----+----+
|  1|  -1|  -3|  -5|
|  3|   1|   0|   0|
|  2|  -3|   1|   4|
+---+----+----+----+

Answer 2

Lets register a udf and use numpy's .ptp让我们注册一个 udf 并使用 numpy 的 .ptp

from pyspark.sql import Window
import pyspark.sql.functions as F

c = udf(lambda x: float(np.ptp(x)), FloatType())#register udf
df.groupBy('id').agg(c(F.collect_list('val1')).alias('v1'),c(F.collect_list('val2')).alias('v2'),c(F.collect_list('val3')).alias('v3')).show()#Apply udf

+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
|  1|1.0|3.0|5.0|
|  2|3.0|1.0|4.0|
|  3|1.0|0.0|0.0|
+---+---+---+---+

组聚合 PySpark 中的算术减法

问题描述

2 个解决方案

解决方案1
2 已采纳 2021-07-03 04:44:29

解决方案2
1 2021-07-03 11:11:05

组聚合 PySpark 中的算术减法

问题描述

2 个解决方案

解决方案1 2 已采纳 2021-07-03 04:44:29

解决方案2 1 2021-07-03 11:11:05

解决方案1
2 已采纳 2021-07-03 04:44:29

解决方案2
1 2021-07-03 11:11:05