[英]How can we compare two dataframes in spark scala to find difference between these 2 files, which column ?? and value?
[英]Find difference of column value in spark using scala
我有一个包含 n 列的数据框,如下所示。
+---+------------+--------+--------+--------+
|id | date|signal01|signal02|signal03|......signal(n)
+---+------------+--------+--------+--------+
|050|2021-01-14 |1 |3 |1 |
|050|2021-01-15 |null |4 |2 |
|050|2021-02-02 |2 |5 |3 |
|051|2021-01-14 |1 |3 |0 |
|051|2021-01-15 |null |null |null |
|051|2021-02-02 |3 |3 |2 |
|051|2021-02-03 |4 |3 |3 |
|052|2021-03-03 |1 |3 |0 |
|052|2021-03-05 |null |3 |null |
|052|2021-03-06 |null |null |2 |
|052|2021-03-16 |3 |5 |5 |.......value(n)
+-------------------------------------------+
我必须为每个信号添加一个信号差异值列,如下所示,不包括空记录并将第一个差异值保持为 0。
+---+------------+--------+-------------+--------+-------------+--------+-------------+
|id | date|signal01|signal01_diff|signal02|signal02_diff|signal03|signal03_diff|......signal(n)
+---+------------+--------+-------------+--------+-------------+--------+-------------+
|050|2021-01-14 |1 |0 |3 |0 |1 |0 |
|050|2021-01-15 |null |null |4 |1 |2 |1 |
|050|2021-02-02 |2 |1 |5 |1 |3 |1 |
|051|2021-01-14 |1 |0 |3 |0 |0 |0 |
|051|2021-01-15 |null |null |null |null |null |null |
|051|2021-02-02 |3 |2 |3 |0 |2 |2 |
|051|2021-02-03 |4 |1 |3 |0 |3 |1 |
|052|2021-03-03 |1 |0 |3 |0 |0 |0 |
|052|2021-03-05 |null |null |3 |0 |null |null |
|052|2021-03-06 |null |null |null |null |2 |2 |
|052|2021-03-16 |3 |2 |5 |2 |5 |3 |.......value(n)
+-----------------------------------------------------------------------+--------------
我尝试过滞后和窗口函数,但由于空值而没有获得预期的输出。
val w = org.apache.spark.sql.expressions.Window.orderBy("id")
val dfWithLag = df.withColumn("signal01_lag", lag("signal01", 1, 0).over(w))
以上是单列的代码,我必须对其余 n 列执行相同的代码。
有没有什么最佳方法可以实现这一目标?
这是一个很好的示例数据集,用于说明需求。 根据预期的输出要求,您的代码存在几个问题:
orderBy("id")
,Window spec w
应该按“id”分区并按“date”排序lag
将无法处理连续行之间的null
信号下面显示的方法利用 Window 函数last
over rowsBetween()
来跟踪最后一个non-null
信号来计算所需的逐行信号差异:
val df = Seq(
("050", "2021-01-14", Some(1), Some(3), Some(1)),
("050", "2021-01-15", None, Some(4), Some(2)),
("050", "2021-02-02", Some(2), Some(5), Some(3)),
("051", "2021-01-14", Some(1), Some(3), Some(0)),
("051", "2021-01-15", None, None, None),
("051", "2021-02-02", Some(3), Some(3), Some(2)),
("051", "2021-02-03", Some(4), Some(3), Some(3)),
("052", "2021-03-03", Some(1), Some(3), Some(0)),
("052", "2021-03-05", None, Some(3), None),
("052", "2021-03-06", None, None, Some(2)),
("052", "2021-03-16", Some(3), Some(5), Some(5))
).toDF("id", "date", "signal01", "signal02", "signal03")
import org.apache.spark.sql.expressions.Window
val w = Window.partitionBy("id").orderBy("date").
rowsBetween(Window.unboundedPreceding, -1)
val signals = df.columns.filter(_ matches "signal\\d+")
val signalCols = signals.map(col)
val otherCols = df.columns.map(col) diff signalCols
df.select(
otherCols ++
signalCols ++
signals.map(s =>
(col(s) - coalesce(last(col(s), ignoreNulls=true).over(w), col(s))).as(s"${s}_diff")
): _*
).
orderBy("id", "date"). // only for ordered display
show
/*
+---+----------+--------+--------+--------+-------------+-------------+-------------+
| id| date|signal01|signal02|signal03|signal01_diff|signal02_diff|signal03_diff|
+---+----------+--------+--------+--------+-------------+-------------+-------------+
|050|2021-01-14| 1| 3| 1| 0| 0| 0|
|050|2021-01-15| null| 4| 2| null| 1| 1|
|050|2021-02-02| 2| 5| 3| 1| 1| 1|
|051|2021-01-14| 1| 3| 0| 0| 0| 0|
|051|2021-01-15| null| null| null| null| null| null|
|051|2021-02-02| 3| 3| 2| 2| 0| 2|
|051|2021-02-03| 4| 3| 3| 1| 0| 1|
|052|2021-03-03| 1| 3| 0| 0| 0| 0|
|052|2021-03-05| null| 3| null| null| 0| null|
|052|2021-03-06| null| null| 2| null| null| 2|
|052|2021-03-16| 3| 5| 5| 2| 2| 3|
+---+----------+--------+--------+--------+-------------+-------------+-------------+
*/
您可以使用 foldLeft 遍历 col 列表并创建所需的新列。
val cols= df.columns.toSeq
val newDf = cols.foldLeft(df)((df, col) =>
df.withColumn(s"$col_lag", lag(s"$col", 1, 0).over(w))
)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.