简体   繁体   English

Spark数据框/数据集:通用条件累积总和

[英]Spark Dataframe/ Dataset: Generic Conditional cumulative sum

I have a dataframe which has a few attributes (C1 to C2), an offset (in days) and a few values (V1, V2). 我有一个数据框,其中包含一些属性(C1至C2),偏移量(以天为单位)和一些值(V1,V2)。

val inputDF= spark.sparkContext.parallelize(Seq((1,2,30, 100, -1),(1,2,30, 100, 0), (1,2,30, 100, 1),(11,21,30, 100, -1),(11,21,30, 100, 0), (11,21,30, 100, 1)), 10).toDF("c1", "c2", "v1", "v2", "offset")
inputDF: org.apache.spark.sql.DataFrame = [c1: int, c2: int ... 3 more fields]

scala> inputDF.show
+---+---+---+---+------+
| c1| c2| v1| v2|offset|
+---+---+---+---+------+
|  1|  2| 30|100|    -1|
|  1|  2| 30|100|     0|
|  1|  2| 30|100|     1|
| 11| 21| 30|100|    -1|
| 11| 21| 30|100|     0|
| 11| 21| 30|100|     1|
+---+---+---+---+------+

What I need to do is, calculate the cumulative sum for V1, V2 for (c1,c2) across offset. 我需要做的是,计算偏移量中(c1,c2)的V1,V2的累积和。

I tried this but that's far away from a generic solution that could work on any data frame. 我试过了,但这与可以在任何数据帧上工作的通用解决方案相去甚远。

import org.apache.spark.sql.expressions.Window

val groupKey = List("c1", "c2").map(x => col(x.trim))
val orderByKey = List("offset").map(x => col(x.trim))

val w = Window.partitionBy(groupKey: _*).orderBy(orderByKey: _*)

val outputDF = inputDF
  .withColumn("cumulative_v1", sum(inputDF("v1")).over(w))
  .withColumn("cumulative_v2", sum(inputDF("v2")).over(w))

+---+---+---+---+------+----------------------------
| c1| c2| v1| v2|offset|cumulative_v1| cumulative_v2|
+---+---+---+---+------+-------------|--------------|
|  1|  2| 30|100|    -1|30           | 100          |
|  1|  2| 30|100|     0|60           | 200          |
|  1|  2| 30|100|     1|90           | 300          |
| 11| 21| 30|100|    -1|30           | 100          |
| 11| 21| 30|100|     0|60           | 200          |
| 11| 21| 30|100|     1|90           | 300          |
+---+---+---+---+------+-----------------------------

The challenge is [a] I need to do this across multiple and varying offset windows (-1 to 1), (-10 to 10), (-30 to 30) or any others [b] I need to use this function across multiple dataframes/ datasets, so I'm hoping for a generic function that could either work in RDD/ Dataset. 挑战是[a]我需要在多个且变化的偏移窗口(-1至1),(-10至10),(-30至30)或任何其他偏移窗口中执行此操作[b]我需要在多个数据框/数据集,所以我希望可以在RDD /数据集中使用的通用函数。

Any thoughts on how I could achieve this in Spark 2.0? 关于如何在Spark 2.0中实现此目标有任何想法吗?

Help is much appreciated. 非常感谢您的帮助。 Thanks! 谢谢!

Here's a primitive take using just data frames. 这是仅使用数据帧的原始方法。

import org.apache.spark.sql.expressions.Window

val groupKey = List("c1", "c2").map(x => col(x.trim))
val orderByKey = List("offset").map(x => col(x.trim))

val w = Window.partitionBy(groupKey: _*).orderBy(orderByKey: _*)

val inputDF= spark
  .sparkContext
  .parallelize(Seq((1,2,30, 100, -1),(1,2,3, 100, -2),(1,2,140, 100, 2),(1,2,30, 100, 0), (1,2,30, 100, 1),(11,21,30, 100, -1),(11,21,30, 100, 0), (11,21,30, 100, 1)), 10)
  .toDF("c1", "c2", "v1", "v2", "offset")

val outputDF = inputDF
  .withColumn("cumulative_v1", sum(when($"offset".between(-1, 1), inputDF("v1")).otherwise(0)).over(w))
  .withColumn("cumulative_v3", sum(when($"offset".between(-2, 2), inputDF("v1")).otherwise(0)).over(w))
  .withColumn("cumulative_v2", sum(inputDF("v2")).over(w))

This produces a cumulative sum over a single 'value' for different windows. 对于不同的窗口,这将在单个“值”上产生累计和。

scala> outputDF.show
+---+---+---+---+------+-------------+-------------+-------------+              
| c1| c2| v1| v2|offset|cumulative_v1|cumulative_v3|cumulative_v2|
+---+---+---+---+------+-------------+-------------+-------------+
|  1|  2|  3|100|    -2|            0|            0|          100|
|  1|  2| 30|100|    -1|           30|           30|          200|
|  1|  2| 30|100|     0|           60|           60|          300|
|  1|  2| 30|100|     1|           90|           90|          400|
|  1|  2|140|100|     2|           90|           90|          500|
| 11| 21| 30|100|    -1|           30|           30|          100|
| 11| 21| 30|100|     0|           60|           60|          200|
| 11| 21| 30|100|     1|           90|           90|          300|
+---+---+---+---+------+-------------+-------------+-------------+

A couple of drawbacks of this approach - [1] for each conditional window (-1,1), (-2,2) or any (from_offset, to_offset), sum() needs to be called separately. 此方法有几个缺点-每个条件窗口(-1,1),(-2,2)或任何条件(from_offset,to_offset)的[1],都需要分别调用sum()。 [2] this isn't a generic function. [2]这不是通用函数。

I know spark accepts a variable list of columns for aggregate functions like this - 我知道spark接受像这样的聚合函数的可变列列表-

val exprs = Map("v1" -> "sum", "v2" -> "sum")

But I'm unsure of how to extend this for window functions with variable conditions. 但是我不确定如何将其扩展为具有可变条件的窗口函数。 I'm still very curious to know if there is a better and modular/ reusable function that we can write to solve this. 我仍然很想知道是否存在我们可以编写以解决此问题的更好且模块化/可重用的函数。

要解决这个问题的另一种通用的方法是用foldLeft如下解释- https://stackoverflow.com/a/44532867/7059145

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM