Pyspark在上一行的查詢中按組迭代數據幀

Question

請幫助我，因為我是新手。 下面是mydataframe

type col1 col2 col3
1    0    41   0
1    27   0    0
1    1    0    0 
1    183  0    2
2    null 0    0
2    null 10   0
3    0    126  0
3    2    0    1
3    4    0    0
3    5    0    0

下面應該是我的輸出

type col1 col2 col3 result
1    0    41   0    0
1    27   0    0    14
1    1    0    0    13
1    183  0    2    -168
2    null 0    0
2    null 10   0
3    0    126  0    0
3    2    0    1    125
3    4    0    0    121
3    5    0    0    116

挑戰在於，必須對每個類型列的每個組都執行此公式，就像prev（col2）-col1 + col3

我試圖在col2上使用window和lag函數來填充結果列，但是它不起作用。

下面是我的代碼

part = Window().partitionBy().orderBy('type')
DF = DF.withColumn('result',lag("col2").over(w)-DF.col1+DF.col3)

現在我正在努力嘗試使用地圖功能，請幫助

Answer 1

邏輯有點棘手和復雜。

您可以在pyspark執行以下pyspark

pyspark

from pyspark.sql import functions as F
from pyspark.sql import Window
import sys
windowSpec = Window.partitionBy("type").orderBy("type")
df = df.withColumn('result', F.lag(df.col2, 1).over(windowSpec) - df.col1 + df.col3)
df = df.withColumn('result', F.when(df.result.isNull(), F.lit(0)).otherwise(df.result))
df = df.withColumn('result', F.sum(df.result).over(windowSpec.rowsBetween(-sys.maxsize, -1)) + df.result)
df = df.withColumn('result', F.when(df.result.isNull(), F.lit(0)).otherwise(df.result))

斯卡拉

import org.apache.spark.sql.expressions._
import org.apache.spark.sql.functions._
val windowSpec = Window.partitionBy("type").orderBy("type")
df.withColumn("result", lag("col2", 1).over(windowSpec) - $"col1"+$"col3")
  .withColumn("result", when($"result".isNull, lit(0)).otherwise($"result"))
  .withColumn("result", sum("result").over(windowSpec.rowsBetween(Long.MinValue, -1)) +$"result")
  .withColumn("result", when($"result".isNull, lit(0)).otherwise($"result"))

您應該得到以下結果。

+----+----+----+----+------+
|type|col1|col2|col3|result|
+----+----+----+----+------+
|1   |0   |41  |0   |0.0   |
|1   |27  |0   |0   |14.0  |
|1   |1   |0   |0   |13.0  |
|1   |183 |0   |2   |-168.0|
|3   |0   |126 |0   |0.0   |
|3   |2   |0   |1   |125.0 |
|3   |4   |0   |0   |121.0 |
|3   |5   |0   |0   |116.0 |
|2   |null|0   |0   |0.0   |
|2   |null|10  |0   |0.0   |
+----+----+----+----+------+

已編輯

第一個withColumn應用公式prev(col2) - col1 + col3 withColumn prev(col2) - col1 + col3 。 第二個withColumn將result列的null更改為0 。 第三個withColumn用於累積總和， withColumn所有值相加直到結果列的當前行。 因此，三個withColumn等同於prev(col2) + prev(results) 1 col1 + col3 。 最后一個withColumn將result列中的空值更改為0 。

Pyspark在上一行的查詢中按組迭代數據幀

問題描述

1 個解決方案

解決方案1
2 已采納 2017-09-20 07:33:37

Pyspark在上一行的查詢中按組迭代數據幀

問題描述

1 個解決方案

解決方案1 2 已采納 2017-09-20 07:33:37

解決方案1
2 已采納 2017-09-20 07:33:37