[英]Arithmetic subtraction in group aggregation PySpark
I have the following dataframe:我有以下数据框:
ID val1 val2 val3 ...
1 4 1 3 ...
1 5 4 8 ...
2 6 3 6 ...
2 9 2 2 ...
3 2 1 4 ...
3 1 1 4 ...
I need to group/aggregate by ID and subtract the values, producing the following output:我需要按 ID 分组/聚合并减去值,产生以下输出:
ID val1 val2 val3 ...
1 -1 -3 -5 ...
2 -3 1 4 ...
3 1 0 0 ...
My current approach would produce the desired output for 1 column at a time:我目前的方法会一次为 1 列生成所需的输出:
from pyspark.sql.functions import first, last
output = df.groupBy('id').agg(first('val1') - (last(col('val1'))))
However, my data set has numerous columns and I would need to find a clean way to do it for all columns.但是,我的数据集有很多列,我需要为所有列找到一种干净的方法。
Check below code.检查下面的代码。
df
.groupBy(col("id"))
.agg(
(first(col("val1")) - last(col("val1"))).as("val1"),
(first(col("val2")) - last(col("val2"))).as("val2"),
(first(col("val3")) - last(col("val3"))).as("val3")
)
.orderBy(col("id"), ascending=True)
.show(false)
+---+----+----+----+
|id |val1|val2|val3|
+---+----+----+----+
|1 |-1 |-3 |-5 |
|2 |-3 |1 |4 |
|3 |1 |0 |0 |
+---+----+----+----+
aggCols = map(lambda c: (first(col(c)) - last(col(c))).alias(c),filter(lambda c: c != "id", df.columns))
df.groupBy(col("id")).agg(*aggCols).show()
+---+----+----+----+
| id|val1|val2|val3|
+---+----+----+----+
| 1| -1| -3| -5|
| 3| 1| 0| 0|
| 2| -3| 1| 4|
+---+----+----+----+
Lets register a udf and use numpy's .ptp让我们注册一个 udf 并使用 numpy 的 .ptp
from pyspark.sql import Window
import pyspark.sql.functions as F
c = udf(lambda x: float(np.ptp(x)), FloatType())#register udf
df.groupBy('id').agg(c(F.collect_list('val1')).alias('v1'),c(F.collect_list('val2')).alias('v2'),c(F.collect_list('val3')).alias('v3')).show()#Apply udf
+---+---+---+---+
| id| v1| v2| v3|
+---+---+---+---+
| 1|1.0|3.0|5.0|
| 2|3.0|1.0|4.0|
| 3|1.0|0.0|0.0|
+---+---+---+---+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.