Spark / Scala：根据当前行中的值变化将行拆分为几行

Question

使用 Databricks，Spark 3.0.1

要使用旧格式，我设置了： spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

我有一个类似于以下示例的数据框。

每行需要根据连续值的变化分成几行。 其他列可以用空值填充。

样品：

+----+----+----+----+----+----+----+----+----+----+
|id  |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1   |100 |100 |100 |500 |500 |500 |200 |200 |200 |
|2   |100 |100 |700 |700 |700 |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+

预期输出：

+----+----+----+----+----+----+----+----+----+----+   
|id  |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1   |100 |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |
|2   |Nan |Nan |Nan |500 |500 |500 |Nan |Nan |Nan |
|3   |Nan |Nan |Nan |Nan |Nan |Nan |200 |200 |200 |
|4   |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |Nan |
|5   |Nan |Nan |700 |700 |700 |Nan |Nan |Nan |Nan |
|6   |Nan |Nan |Nan |Nan |Nan |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+

Answer 1

鉴于此数据框：

+---+---+---+---+---+---+---+---+---+---+
| id| n1| n2| n3| n4| n5| n6| n7| n8| n9|
+---+---+---+---+---+---+---+---+---+---+
|  1|100|100|100|500|500|500|200|200|200|
|  2|100|100|700|700|700|100|100|100|100|
+---+---+---+---+---+---+---+---+---+---+

我想出了一个基于数据框和数据集混合的解决方案：

val l = 9 // number of cols
df
  // put values into array
  .select($"id", array(df.columns.tail.map(col): _*).as("col"))
  // switch to dataset api
  .as[(Int, Seq[Int])]
  .flatMap { case (id, arr) => {
    val arrI = arr.zipWithIndex
    // split list in sublist based on adjacent values
    arrI.tail
      .foldLeft(Seq(Seq(arrI.head)))((acc, curr) =>
        if (acc.last.last._1 == curr._1) {
          acc.init :+ (acc.last :+ curr)
        } else {
          acc :+ Seq(curr)
        }
      )
     // aggregate sublists into value, from, to
      .map(chunk => (chunk.head._1, chunk.map(_._2).min, chunk.map(_._2).max))
      // generate new lists, fill with Nones
      .zipWithIndex
      .map { case ((num, from, to),subI) => (id,subI+1,(0 until l).map(i=> if(i>=from && i<=to) Some(num) else None))}
  }
  }
  .toDF("id","sub_id","values") // back to dataframe api
  // rename columns
  .select($"id"+:$"sub_id"+:(0 until l).map(i => $"values"(i).as(s"n${i+1}")):_*)
  .show(false)

产生：

+---+------+----+----+----+----+----+----+----+----+----+
|id |sub_id|n1  |n2  |n3  |n4  |n5  |n6  |n7  |n8  |n9  |
+---+------+----+----+----+----+----+----+----+----+----+
|1  |1     |100 |100 |100 |null|null|null|null|null|null|
|1  |2     |null|null|null|500 |500 |500 |null|null|null|
|1  |3     |null|null|null|null|null|null|200 |200 |200 |
|2  |1     |100 |100 |null|null|null|null|null|null|null|
|2  |2     |null|null|700 |700 |700 |null|null|null|null|
|2  |3     |null|null|null|null|null|100 |100 |100 |100 |
+---+------+----+----+----+----+----+----+----+----+----+

如您所见，我还没有成功获得正确的 id，这需要更多的工作。 问题是要创建一个后续的 id，这将需要一个广泛的转换（没有分区的窗口函数），这将导致性能瓶颈

Answer 2

一些很好的老式编码。 以及一些可学习的 Scala 项目。

Any 方面总是难看，所以我使用了 Option 方法。 注意 toString() 需要和动态构建 List 的特定方式。

是函数式编程吗？ 不如其他解决方案，但不确定如何处理嵌套结构的折叠等。 那么，中间解决方案可能对 mapPartitions 很有用。

这是代码：

import spark.implicits._
import org.apache.spark.sql.functions._
import scala.util.Try

def tryToInt( s: String ) = Try(s.toInt).toOption 
def lp( a: List[Int] ) :List[List[Option[Int]]] = {
      var cnt:Int = 0
      var tempAA: List[List[Int]] = List()
      var tempA: List[Int] = List()
      var c:Int = 0
      val sInput = a.size
      // Does not work for empty List, but will not be empty
      for (v <- a) {
                    if (cnt > 0 ) { if (v != c) 
                                       { tempAA = tempAA :+ tempA
                                         tempA = List()
                                  }
                    } 
                    c = v
                    cnt +=1
                    tempA = tempA :+ v 
      }
      tempAA = tempAA :+ tempA

      val numItems = tempAA.map(x => x.size)          // List of occurrences per slot 
      val cumCount = numItems.scanLeft(0)(_ + _).tail // Cumulative count
      var res: List[List[Option[Int]]] = List()
      var tempAAA: List[List[String]] = List()
   
      for (i <- 0 until numItems.length) {
            val itemsLeft = cumCount(i) - numItems(i)
            val itemsRight = sInput - cumCount(i)
            val left = List.fill(itemsLeft)(None)
            val right = List.fill(itemsRight)(None)
        
            tempAAA = List()
            tempAAA = tempAAA :+ left.map(_.toString())
            tempAAA = tempAAA :+ tempAA(i).map(_.toString())
            tempAAA = tempAAA :+ right.map(_.toString())
            val tempAAAA = tempAAA.flatten.map(_.toString()).map(x => tryToInt(x))
            res = res :+ tempAAAA
      }
      return res
}

val dataIn = Seq((1,2,2,3,5,5,5,5,5), (4,2,2,2,5,5,5,5,5), (5,5,5,5,5,5,5,5,5)).toDS()
val data = dataIn.withColumn("input", array(dataIn.columns.map(col): _*)).select($"input").as[List[Int]]
val df = data.rdd.map(lp).toDF().select(explode($"value"))
val n = dataIn.columns.size
df.select( (0 until n).map(i => col("col")(i).alias(s"c${i+1}")): _*).show(false)

返回：

+----+----+----+----+----+----+----+----+----+
|c1  |c2  |c3  |c4  |c5  |c6  |c7  |c8  |c9  |
+----+----+----+----+----+----+----+----+----+
|1   |null|null|null|null|null|null|null|null|
|null|2   |2   |null|null|null|null|null|null|
|null|null|null|3   |null|null|null|null|null|
|null|null|null|null|5   |5   |5   |5   |5   |
|4   |null|null|null|null|null|null|null|null|
|null|2   |2   |2   |null|null|null|null|null|
|null|null|null|null|5   |5   |5   |5   |5   |
|5   |5   |5   |5   |5   |5   |5   |5   |5   |
+----+----+----+----+----+----+----+----+----+

Spark / Scala：根据当前行中的值变化将行拆分为几行

问题描述

2 个解决方案

解决方案1
2 已采纳 2020-11-18 21:05:28

解决方案2
1 2020-11-18 17:48:18

Spark / Scala：根据当前行中的值变化将行拆分为几行

问题描述

2 个解决方案

解决方案1 2 已采纳 2020-11-18 21:05:28

解决方案2 1 2020-11-18 17:48:18

解决方案1
2 已采纳 2020-11-18 21:05:28

解决方案2
1 2020-11-18 17:48:18