![](/img/trans.png)
[英]Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala
[英]Spark / Scala: Split row into several rows based on value change in current row
使用 Databricks,Spark 3.0.1
要使用舊格式,我設置了: spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")
我有一個類似於以下示例的數據框。
每行需要根據連續值的變化分成幾行。 其他列可以用空值填充。
樣品:
+----+----+----+----+----+----+----+----+----+----+
|id |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1 |100 |100 |100 |500 |500 |500 |200 |200 |200 |
|2 |100 |100 |700 |700 |700 |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+
預期輸出:
+----+----+----+----+----+----+----+----+----+----+
|id |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1 |100 |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |
|2 |Nan |Nan |Nan |500 |500 |500 |Nan |Nan |Nan |
|3 |Nan |Nan |Nan |Nan |Nan |Nan |200 |200 |200 |
|4 |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |Nan |
|5 |Nan |Nan |700 |700 |700 |Nan |Nan |Nan |Nan |
|6 |Nan |Nan |Nan |Nan |Nan |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+
鑒於此數據框:
+---+---+---+---+---+---+---+---+---+---+
| id| n1| n2| n3| n4| n5| n6| n7| n8| n9|
+---+---+---+---+---+---+---+---+---+---+
| 1|100|100|100|500|500|500|200|200|200|
| 2|100|100|700|700|700|100|100|100|100|
+---+---+---+---+---+---+---+---+---+---+
我想出了一個基於數據框和數據集混合的解決方案:
val l = 9 // number of cols
df
// put values into array
.select($"id", array(df.columns.tail.map(col): _*).as("col"))
// switch to dataset api
.as[(Int, Seq[Int])]
.flatMap { case (id, arr) => {
val arrI = arr.zipWithIndex
// split list in sublist based on adjacent values
arrI.tail
.foldLeft(Seq(Seq(arrI.head)))((acc, curr) =>
if (acc.last.last._1 == curr._1) {
acc.init :+ (acc.last :+ curr)
} else {
acc :+ Seq(curr)
}
)
// aggregate sublists into value, from, to
.map(chunk => (chunk.head._1, chunk.map(_._2).min, chunk.map(_._2).max))
// generate new lists, fill with Nones
.zipWithIndex
.map { case ((num, from, to),subI) => (id,subI+1,(0 until l).map(i=> if(i>=from && i<=to) Some(num) else None))}
}
}
.toDF("id","sub_id","values") // back to dataframe api
// rename columns
.select($"id"+:$"sub_id"+:(0 until l).map(i => $"values"(i).as(s"n${i+1}")):_*)
.show(false)
產生:
+---+------+----+----+----+----+----+----+----+----+----+
|id |sub_id|n1 |n2 |n3 |n4 |n5 |n6 |n7 |n8 |n9 |
+---+------+----+----+----+----+----+----+----+----+----+
|1 |1 |100 |100 |100 |null|null|null|null|null|null|
|1 |2 |null|null|null|500 |500 |500 |null|null|null|
|1 |3 |null|null|null|null|null|null|200 |200 |200 |
|2 |1 |100 |100 |null|null|null|null|null|null|null|
|2 |2 |null|null|700 |700 |700 |null|null|null|null|
|2 |3 |null|null|null|null|null|100 |100 |100 |100 |
+---+------+----+----+----+----+----+----+----+----+----+
如您所見,我還沒有成功獲得正確的 id,這需要更多的工作。 問題是要創建一個后續的 id,這將需要一個廣泛的轉換(沒有分區的窗口函數),這將導致性能瓶頸
一些很好的老式編碼。 以及一些可學習的 Scala 項目。
Any 方面總是難看,所以我使用了 Option 方法。 注意 toString() 需要和動態構建 List 的特定方式。
是函數式編程嗎? 不如其他解決方案,但不確定如何處理嵌套結構的折疊等。 那么,中間解決方案可能對 mapPartitions 很有用。
這是代碼:
import spark.implicits._
import org.apache.spark.sql.functions._
import scala.util.Try
def tryToInt( s: String ) = Try(s.toInt).toOption
def lp( a: List[Int] ) :List[List[Option[Int]]] = {
var cnt:Int = 0
var tempAA: List[List[Int]] = List()
var tempA: List[Int] = List()
var c:Int = 0
val sInput = a.size
// Does not work for empty List, but will not be empty
for (v <- a) {
if (cnt > 0 ) { if (v != c)
{ tempAA = tempAA :+ tempA
tempA = List()
}
}
c = v
cnt +=1
tempA = tempA :+ v
}
tempAA = tempAA :+ tempA
val numItems = tempAA.map(x => x.size) // List of occurrences per slot
val cumCount = numItems.scanLeft(0)(_ + _).tail // Cumulative count
var res: List[List[Option[Int]]] = List()
var tempAAA: List[List[String]] = List()
for (i <- 0 until numItems.length) {
val itemsLeft = cumCount(i) - numItems(i)
val itemsRight = sInput - cumCount(i)
val left = List.fill(itemsLeft)(None)
val right = List.fill(itemsRight)(None)
tempAAA = List()
tempAAA = tempAAA :+ left.map(_.toString())
tempAAA = tempAAA :+ tempAA(i).map(_.toString())
tempAAA = tempAAA :+ right.map(_.toString())
val tempAAAA = tempAAA.flatten.map(_.toString()).map(x => tryToInt(x))
res = res :+ tempAAAA
}
return res
}
val dataIn = Seq((1,2,2,3,5,5,5,5,5), (4,2,2,2,5,5,5,5,5), (5,5,5,5,5,5,5,5,5)).toDS()
val data = dataIn.withColumn("input", array(dataIn.columns.map(col): _*)).select($"input").as[List[Int]]
val df = data.rdd.map(lp).toDF().select(explode($"value"))
val n = dataIn.columns.size
df.select( (0 until n).map(i => col("col")(i).alias(s"c${i+1}")): _*).show(false)
返回:
+----+----+----+----+----+----+----+----+----+
|c1 |c2 |c3 |c4 |c5 |c6 |c7 |c8 |c9 |
+----+----+----+----+----+----+----+----+----+
|1 |null|null|null|null|null|null|null|null|
|null|2 |2 |null|null|null|null|null|null|
|null|null|null|3 |null|null|null|null|null|
|null|null|null|null|5 |5 |5 |5 |5 |
|4 |null|null|null|null|null|null|null|null|
|null|2 |2 |2 |null|null|null|null|null|
|null|null|null|null|5 |5 |5 |5 |5 |
|5 |5 |5 |5 |5 |5 |5 |5 |5 |
+----+----+----+----+----+----+----+----+----+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.