简体   繁体   English

Apache Spark - 如何为数据框中的每一列创建不同的列?

[英]Apache Spark - how to create difference columns for every column in dataframe?

I have a Spark DataFrame with ID and a bunch of numeric columns and for every column besides ID, I'm trying to generate a column of lagged differences grouped by ID.我有一个带有 ID 的 Spark DataFrame和一堆数字列,对于除 ID 之外的每一列,我试图生成一列按 ID 分组的滞后差异。

For instance if I have this DataFrame例如,如果我有这个DataFrame

+---+-----+-----+-----+
| ID| var1| var2| var3|
+---+-----+-----+-----+
|  1|    1|    3|    2|
|  1|    2|    4|    2|
|  1|    3|    1|    3|
|  2|    1|    3|    4|
|  2|    1|    2|    1|
|  2|    1|    1|    1|
|  2|    3|    3|    1|
|  3|   -1|    0|    0|
|  3|    2|   -1|    2|
|  3|    0|    4|    0|
+---+-----+-----+-----+

I would expect the output to be something like this:我希望输出是这样的:

+---+-----+-----+-----+----------+----------+----------+
| ID| var1| var2| var3| var1_diff| var2_diff| var3_diff|
+---+-----+-----+-----+----------+----------+----------+
|  1|    1|    3|    2|      null|      null|      null|
|  1|    2|    4|    2|         1|         1|         0|
|  1|    3|    1|    3|         1|        -3|         1|
|  2|    1|    3|    4|      null|      null|      null|
|  2|    1|    2|    1|         0|        -1|        -3|
|  2|    1|    1|    1|         0|        -1|         0|
|  2|    3|    3|    1|         2|         2|         0|
|  3|   -1|    0|    0|      null|      null|      null|
|  3|    2|   -1|    2|         3|        -1|         2|
|  3|    0|    4|    0|        -2|         3|        -2|
+---+-----+-----+-----+----------+----------+----------+

Where the _diff columns are the original columns subtracting their lags.其中_diff列是减去滞后的原始列。 My DataFrame has much more than 3 variables, so I'd want to be able to generate lagged differences for an arbitrarily high number of columns.我的DataFrame有超过 3 个变量,所以我希望能够为任意数量的列生成滞后差异。 ie I don't want to create the _diff columns one by one.即我不想_diff创建_diff列。

Any ideas as to how I can achieve this?关于如何实现这一目标的任何想法?

I would suggest you to go with foldLeft (a powerful api in scala)我建议你使用foldLeft (Scala 中的一个强大的 api)

//assuming that the column ID is at the front
val tailColumns = df.columns.tail

import org.apache.spark.sql.expressions._
def windowSpec = Window.partitionBy("ID").orderBy("ID")

tailColumns.foldLeft(df){(tempdf, colName) => tempdf.withColumn(colName+"_diff", col(colName)-lag(col(colName), 1).over(windowSpec))}.show(false)

which should give you这应该给你

+---+----+----+----+---------+---------+---------+
|ID |var1|var2|var3|var1_diff|var2_diff|var3_diff|
+---+----+----+----+---------+---------+---------+
|1  |1   |3   |2   |null     |null     |null     |
|1  |2   |4   |2   |1        |1        |0        |
|1  |3   |1   |3   |1        |-3       |1        |
|3  |-1  |0   |0   |null     |null     |null     |
|3  |2   |-1  |2   |3        |-1       |2        |
|3  |0   |4   |0   |-2       |5        |-2       |
|2  |1   |3   |4   |null     |null     |null     |
|2  |1   |2   |1   |0        |-1       |-3       |
|2  |1   |1   |1   |0        |-1       |0        |
|2  |3   |3   |1   |2        |2        |0        |
+---+----+----+----+---------+---------+---------+

Note: I have used ID in orderBy which is not recomended, recomended to generate a separate column preserving the order of rows and use that instead of ID注意:我在 orderBy 中使用了 ID 不推荐,推荐生成一个单独的列保留行的顺序并使用它代替 ID

I hope the answer is helpful我希望答案有帮助

You will need to use lag as you pointed out in the function with Spark Window function.正如您在带有 Spark Window 函数的函数中指出的那样,您将需要使用lag

You can generate a dynamic expression by storing the columns that you need to find the difference for.您可以通过存储需要查找差异的列来生成动态表达式。

The following will basically create an expression of type org.apache.spark.sql.Column which you can use over your original dataframe.下面将基本上创建一个org.apache.spark.sql.Column类型的表达式,您可以在原始数据帧上使用它。

import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy($"id").orderBy($"id")

//df.columns returns all the columns of the dataframe
//union is used to include the original columns in the expression
//expr looks like : (var1 - lag(var1) over window) as var1_diff ...

val expr = df.columns.map(col(_)) union df.columns.filterNot(_.toLowerCase.equals("id")).map { x => (col(x) - lag(col(x),1).over(w) ).as(s"${x}_diff") }

And then you can execute the expression generated above over your dataframe using select然后您可以使用select在您的数据帧上执行上面生成的表达式

df.select(expr:_*).show
+---+----+----+----+---------+---------+---------+
| id|var1|var2|var3|var1_diff|var2_diff|var3_diff|
+---+----+----+----+---------+---------+---------+
|  1|   1|   3|   2|     null|     null|     null|
|  1|   2|   4|   2|        1|        1|        0|
|  1|   3|   1|   3|        1|       -3|        1|
|  3|  -1|   0|   0|     null|     null|     null|
|  3|   2|  -1|   2|        3|       -1|        2|
|  3|   0|   4|   0|       -2|        5|       -2|
|  2|   1|   3|   4|     null|     null|     null|
|  2|   1|   2|   1|        0|       -1|       -3|
|  2|   1|   1|   1|        0|       -1|        0|
|  2|   3|   3|   1|        2|        2|        0|
+---+----+----+----+---------+---------+---------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM