在Scala中的多个spark数据帧列上删除低于cummax的值

Question

我有一个数据框，如下所示。 信号数量超过100，所以数据帧中会有100多列。

+---+------------+--------+--------+--------+
|id |        date|signal01|signal02|signal03|......
+---+------------+--------+--------+--------+
|050|2021-01-14  |1       |3       |1       |
|050|2021-01-15  |null    |4       |2       |
|050|2021-02-02  |2       |3       |3       |

|051|2021-01-14  |1       |3       |0       |
|051|2021-01-15  |2       |null    |null    |
|051|2021-02-02  |3       |3       |2       |
|051|2021-02-03  |1       |3       |1       |

|052|2021-03-03  |1       |3       |0       |
|052|2021-03-05  |3       |3       |null    |
|052|2021-03-06  |2       |null    |2       |
|052|2021-03-16  |3       |5       |5       |.......
+-------------------------------------------+

我必须找出每个信号的 cummax，然后与各自的信号列进行比较并删除其值低于 cummax 和空值的信号记录。

步骤1。 找到关于 id 列的每个信号的累积最大值。

第2步。 删除每个信号的值低于 cummax 的记录。

第三步。 对每个信号的 cummax 小于信号值（不包括空值）的记录进行计数。

计数后的最终输出应如下所示。

+---+------------+--------+--------+--------+
|id |        date|signal01|signal02|signal03|.....
+---+------------+--------+--------+--------+
|050|2021-01-14  |1       |  3     | 1      | 
|050|2021-01-15  |null    |  null  | 2      | 
|050|2021-02-02  |2       |  3     | 3      | 
                                   |          
|051|2021-01-14  |1       |  3     | 0      | 
|051|2021-01-15  |2       |  null  | null   | 
|051|2021-02-02  |3       |  3     | 2      | 
|051|2021-02-03  |null    |  3     | null   | 
                                   |          
|052|2021-03-03  |1       |  3     | 0      | 
|052|2021-03-05  |3       |  3     | null   | 
|052|2021-03-06  |null    |  null  | 2      | 
|052|2021-03-16  |3       |  5     | 5      | ......
+----------------+--------+--------+--------+

我尝试使用如下窗口函数，它几乎适用于所有记录。

val w = Window.partitionBy("id").orderBy("date").rowsBetween(Window.unboundedPreceding, Window.currentRow) 
val signalList01 = ListBuffer[Column]() 
signalList01.append(col("id"), col("date")) 
for (column <- signalColumns) {         
// Applying the max non null aggregate function on each signal column           
signalList01 += (col(column), max(column).over(w).alias(column+"_cummax"))       } 
val cumMaxDf = df.select(signalList01: _*)

但是我收到了如下所示的几条记录的错误值。

有没有关于这个错误如何在 cummax 列中记录的想法？ 任何线索表示赞赏！

Answer 1

只是在此处给出提示（如您所建议的）以帮助您解除封锁，但 --WARNING-- 尚未测试代码！

您在评论中提供的代码看起来不错。 它会给你你的最大列

val nw_df = original_df.withColumn("singal01_cummax", sum(col("singal01")).over(windowCodedSO))

现在，您需要能够比较“singal01”和“singal01_cummax”中的两个值。 像这样的函数，也许：

 def takeOutRecordsLessThanCummax (signal:Int, singal_cummax: Int) : Any = { if (signal == null || signal < singal_cummax) null else singal_cummax }

因为我们将把它应用到列，所以我们将它包装在一个 UDF 中

val takeOutRecordsLessThanCummaxUDF : UserDefinedFunction = udf { (i:Int, j:Int) => takeOutRecordsLessThanCummax(i,j) }

然后，您可以组合上述所有内容，使其适用于您的原始数据帧。 这样的事情可以工作：

 val signal_cummax_suffix = "_cummax" val result = original_df.columns.foldLeft(original_df)( (dfac, colname) => dfac .withColumn(colname.concat(signal_cummax_suffix), sum(col(colname)).over(windowCodedSO)) .withColumn(colname.concat("output"), takeOutRecordsLessThanCummaxUDF(col(colname), col(colname.concat(signal_cummax_suffix)))) )

在Scala中的多个spark数据帧列上删除低于cummax的值

问题描述

1 个解决方案

解决方案1
0 2021-07-07 18:35:22

在Scala中的多个spark数据帧列上删除低于cummax的值

问题描述

1 个解决方案

解决方案1 0 2021-07-07 18:35:22

解决方案1
0 2021-07-07 18:35:22