![](/img/trans.png)
[英]Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala
[英]Spark / Scala: Split row into several rows based on value change in current row
使用 Databricks,Spark 3.0.1
要使用旧格式,我设置了: spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
我有一个类似于以下示例的数据框。
每行需要根据连续值的变化分成几行。 其他列可以用空值填充。
样品:
+----+----+----+----+----+----+----+----+----+----+
|id |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1 |100 |100 |100 |500 |500 |500 |200 |200 |200 |
|2 |100 |100 |700 |700 |700 |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+
预期输出:
+----+----+----+----+----+----+----+----+----+----+
|id |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1 |100 |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |
|2 |Nan |Nan |Nan |500 |500 |500 |Nan |Nan |Nan |
|3 |Nan |Nan |Nan |Nan |Nan |Nan |200 |200 |200 |
|4 |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |Nan |
|5 |Nan |Nan |700 |700 |700 |Nan |Nan |Nan |Nan |
|6 |Nan |Nan |Nan |Nan |Nan |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+
鉴于此数据框:
+---+---+---+---+---+---+---+---+---+---+
| id| n1| n2| n3| n4| n5| n6| n7| n8| n9|
+---+---+---+---+---+---+---+---+---+---+
| 1|100|100|100|500|500|500|200|200|200|
| 2|100|100|700|700|700|100|100|100|100|
+---+---+---+---+---+---+---+---+---+---+
我想出了一个基于数据框和数据集混合的解决方案:
val l = 9 // number of cols
df
// put values into array
.select($"id", array(df.columns.tail.map(col): _*).as("col"))
// switch to dataset api
.as[(Int, Seq[Int])]
.flatMap { case (id, arr) => {
val arrI = arr.zipWithIndex
// split list in sublist based on adjacent values
arrI.tail
.foldLeft(Seq(Seq(arrI.head)))((acc, curr) =>
if (acc.last.last._1 == curr._1) {
acc.init :+ (acc.last :+ curr)
} else {
acc :+ Seq(curr)
}
)
// aggregate sublists into value, from, to
.map(chunk => (chunk.head._1, chunk.map(_._2).min, chunk.map(_._2).max))
// generate new lists, fill with Nones
.zipWithIndex
.map { case ((num, from, to),subI) => (id,subI+1,(0 until l).map(i=> if(i>=from && i<=to) Some(num) else None))}
}
}
.toDF("id","sub_id","values") // back to dataframe api
// rename columns
.select($"id"+:$"sub_id"+:(0 until l).map(i => $"values"(i).as(s"n${i+1}")):_*)
.show(false)
产生:
+---+------+----+----+----+----+----+----+----+----+----+
|id |sub_id|n1 |n2 |n3 |n4 |n5 |n6 |n7 |n8 |n9 |
+---+------+----+----+----+----+----+----+----+----+----+
|1 |1 |100 |100 |100 |null|null|null|null|null|null|
|1 |2 |null|null|null|500 |500 |500 |null|null|null|
|1 |3 |null|null|null|null|null|null|200 |200 |200 |
|2 |1 |100 |100 |null|null|null|null|null|null|null|
|2 |2 |null|null|700 |700 |700 |null|null|null|null|
|2 |3 |null|null|null|null|null|100 |100 |100 |100 |
+---+------+----+----+----+----+----+----+----+----+----+
如您所见,我还没有成功获得正确的 id,这需要更多的工作。 问题是要创建一个后续的 id,这将需要一个广泛的转换(没有分区的窗口函数),这将导致性能瓶颈
一些很好的老式编码。 以及一些可学习的 Scala 项目。
Any 方面总是难看,所以我使用了 Option 方法。 注意 toString() 需要和动态构建 List 的特定方式。
是函数式编程吗? 不如其他解决方案,但不确定如何处理嵌套结构的折叠等。 那么,中间解决方案可能对 mapPartitions 很有用。
这是代码:
import spark.implicits._
import org.apache.spark.sql.functions._
import scala.util.Try
def tryToInt( s: String ) = Try(s.toInt).toOption
def lp( a: List[Int] ) :List[List[Option[Int]]] = {
var cnt:Int = 0
var tempAA: List[List[Int]] = List()
var tempA: List[Int] = List()
var c:Int = 0
val sInput = a.size
// Does not work for empty List, but will not be empty
for (v <- a) {
if (cnt > 0 ) { if (v != c)
{ tempAA = tempAA :+ tempA
tempA = List()
}
}
c = v
cnt +=1
tempA = tempA :+ v
}
tempAA = tempAA :+ tempA
val numItems = tempAA.map(x => x.size) // List of occurrences per slot
val cumCount = numItems.scanLeft(0)(_ + _).tail // Cumulative count
var res: List[List[Option[Int]]] = List()
var tempAAA: List[List[String]] = List()
for (i <- 0 until numItems.length) {
val itemsLeft = cumCount(i) - numItems(i)
val itemsRight = sInput - cumCount(i)
val left = List.fill(itemsLeft)(None)
val right = List.fill(itemsRight)(None)
tempAAA = List()
tempAAA = tempAAA :+ left.map(_.toString())
tempAAA = tempAAA :+ tempAA(i).map(_.toString())
tempAAA = tempAAA :+ right.map(_.toString())
val tempAAAA = tempAAA.flatten.map(_.toString()).map(x => tryToInt(x))
res = res :+ tempAAAA
}
return res
}
val dataIn = Seq((1,2,2,3,5,5,5,5,5), (4,2,2,2,5,5,5,5,5), (5,5,5,5,5,5,5,5,5)).toDS()
val data = dataIn.withColumn("input", array(dataIn.columns.map(col): _*)).select($"input").as[List[Int]]
val df = data.rdd.map(lp).toDF().select(explode($"value"))
val n = dataIn.columns.size
df.select( (0 until n).map(i => col("col")(i).alias(s"c${i+1}")): _*).show(false)
返回:
+----+----+----+----+----+----+----+----+----+
|c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |c9 |
+----+----+----+----+----+----+----+----+----+
|1 |null|null|null|null|null|null|null|null|
|null|2 |2 |null|null|null|null|null|null|
|null|null|null|3 |null|null|null|null|null|
|null|null|null|null|5 |5 |5 |5 |5 |
|4 |null|null|null|null|null|null|null|null|
|null|2 |2 |2 |null|null|null|null|null|
|null|null|null|null|5 |5 |5 |5 |5 |
|5 |5 |5 |5 |5 |5 |5 |5 |5 |
+----+----+----+----+----+----+----+----+----+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.