简体   繁体   English

Spark / Scala:根据当前行中的值变化将行拆分为几行

[英]Spark / Scala: Split row into several rows based on value change in current row

Using Databricks, Spark 3.0.1使用 Databricks,Spark 3.0.1

To use the legacy format, I have set: spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")要使用旧格式,我设置了: spark.conf.set("spark.sql.legacy.timeParserPolicy","LEGACY")

I have a dataframe similar to the sample below.我有一个类似于以下示例的数据框。

Each row needs to be split in several rows based on a change in consecutive values.每行需要根据连续值的变化分成几行。 The other columns can be filled with null.其他列可以用空值填充。

Sample :样品

+----+----+----+----+----+----+----+----+----+----+
|id  |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1   |100 |100 |100 |500 |500 |500 |200 |200 |200 |
|2   |100 |100 |700 |700 |700 |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+

Expected Output:预期输出:

+----+----+----+----+----+----+----+----+----+----+   
|id  |t.n1|t.n2|t.n3|t.n4|t.n5|t.n6|t.n7|t.n8|t.n9|
+----+----+----+----+----+----+----+----+----+----+
|1   |100 |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |
|2   |Nan |Nan |Nan |500 |500 |500 |Nan |Nan |Nan |
|3   |Nan |Nan |Nan |Nan |Nan |Nan |200 |200 |200 |
|4   |100 |100 |Nan |Nan |Nan |Nan |Nan |Nan |Nan |
|5   |Nan |Nan |700 |700 |700 |Nan |Nan |Nan |Nan |
|6   |Nan |Nan |Nan |Nan |Nan |100 |100 |100 |100 |
+----+----+----+----+----+----+----+----+----+----+

given this dataframe :鉴于此数据框:

+---+---+---+---+---+---+---+---+---+---+
| id| n1| n2| n3| n4| n5| n6| n7| n8| n9|
+---+---+---+---+---+---+---+---+---+---+
|  1|100|100|100|500|500|500|200|200|200|
|  2|100|100|700|700|700|100|100|100|100|
+---+---+---+---+---+---+---+---+---+---+

I came up with a solution based in a mixture between dataframes and datasets:我想出了一个基于数据框和数据集混合的解决方案:

val l = 9 // number of cols
df
  // put values into array
  .select($"id", array(df.columns.tail.map(col): _*).as("col"))
  // switch to dataset api
  .as[(Int, Seq[Int])]
  .flatMap { case (id, arr) => {
    val arrI = arr.zipWithIndex
    // split list in sublist based on adjacent values
    arrI.tail
      .foldLeft(Seq(Seq(arrI.head)))((acc, curr) =>
        if (acc.last.last._1 == curr._1) {
          acc.init :+ (acc.last :+ curr)
        } else {
          acc :+ Seq(curr)
        }
      )
     // aggregate sublists into value, from, to
      .map(chunk => (chunk.head._1, chunk.map(_._2).min, chunk.map(_._2).max))
      // generate new lists, fill with Nones
      .zipWithIndex
      .map { case ((num, from, to),subI) => (id,subI+1,(0 until l).map(i=> if(i>=from && i<=to) Some(num) else None))}
  }
  }
  .toDF("id","sub_id","values") // back to dataframe api
  // rename columns
  .select($"id"+:$"sub_id"+:(0 until l).map(i => $"values"(i).as(s"n${i+1}")):_*)
  .show(false)

which yields:产生:

+---+------+----+----+----+----+----+----+----+----+----+
|id |sub_id|n1  |n2  |n3  |n4  |n5  |n6  |n7  |n8  |n9  |
+---+------+----+----+----+----+----+----+----+----+----+
|1  |1     |100 |100 |100 |null|null|null|null|null|null|
|1  |2     |null|null|null|500 |500 |500 |null|null|null|
|1  |3     |null|null|null|null|null|null|200 |200 |200 |
|2  |1     |100 |100 |null|null|null|null|null|null|null|
|2  |2     |null|null|700 |700 |700 |null|null|null|null|
|2  |3     |null|null|null|null|null|100 |100 |100 |100 |
+---+------+----+----+----+----+----+----+----+----+----+

As you can see I was not yet successful to get the correct id, this would need some more work.如您所见,我还没有成功获得正确的 id,这需要更多的工作。 The problem is to make a subsequent id, this would need a wide transformation (window-function without partitioning) which would lead to a performance-bottleneck问题是要创建一个后续的 id,这将需要一个广泛的转换(没有分区的窗口函数),这将导致性能瓶颈

Some good old-fashioned coding.一些很好的老式编码。 Along with a few learnable items for Scala.以及一些可学习的 Scala 项目。

Invariably the Any aspects reared its ugly head so I used Option approach. Any 方面总是难看,所以我使用了 Option 方法。 Note toString() required and a particular way of building List dynamically.注意 toString() 需要和动态构建 List 的特定方式。

Is it functional programming?是函数式编程吗? Less so than the other solution, but was not sure how to do with fold, etc. with nested structures.不如其他解决方案,但不确定如何处理嵌套结构的折叠等。 Intermediate solution this then, that may serve yu well with mapPartitions.那么,中间解决方案可能对 mapPartitions 很有用。

Here is the code:这是代码:

import spark.implicits._
import org.apache.spark.sql.functions._
import scala.util.Try

def tryToInt( s: String ) = Try(s.toInt).toOption 
def lp( a: List[Int] ) :List[List[Option[Int]]] = {
      var cnt:Int = 0
      var tempAA: List[List[Int]] = List()
      var tempA: List[Int] = List()
      var c:Int = 0
      val sInput = a.size
      // Does not work for empty List, but will not be empty
      for (v <- a) {
                    if (cnt > 0 ) { if (v != c) 
                                       { tempAA = tempAA :+ tempA
                                         tempA = List()
                                  }
                    } 
                    c = v
                    cnt +=1
                    tempA = tempA :+ v 
      }
      tempAA = tempAA :+ tempA

      val numItems = tempAA.map(x => x.size)          // List of occurrences per slot 
      val cumCount = numItems.scanLeft(0)(_ + _).tail // Cumulative count
      var res: List[List[Option[Int]]] = List()
      var tempAAA: List[List[String]] = List()
   
      for (i <- 0 until numItems.length) {
            val itemsLeft = cumCount(i) - numItems(i)
            val itemsRight = sInput - cumCount(i)
            val left = List.fill(itemsLeft)(None)
            val right = List.fill(itemsRight)(None)
        
            tempAAA = List()
            tempAAA = tempAAA :+ left.map(_.toString())
            tempAAA = tempAAA :+ tempAA(i).map(_.toString())
            tempAAA = tempAAA :+ right.map(_.toString())
            val tempAAAA = tempAAA.flatten.map(_.toString()).map(x => tryToInt(x))
            res = res :+ tempAAAA
      }
      return res
}

val dataIn = Seq((1,2,2,3,5,5,5,5,5), (4,2,2,2,5,5,5,5,5), (5,5,5,5,5,5,5,5,5)).toDS()
val data = dataIn.withColumn("input", array(dataIn.columns.map(col): _*)).select($"input").as[List[Int]]
val df = data.rdd.map(lp).toDF().select(explode($"value"))
val n = dataIn.columns.size
df.select( (0 until n).map(i => col("col")(i).alias(s"c${i+1}")): _*).show(false)

returns:返回:

+----+----+----+----+----+----+----+----+----+
|c1  |c2  |c3  |c4  |c5  |c6  |c7  |c8  |c9  |
+----+----+----+----+----+----+----+----+----+
|1   |null|null|null|null|null|null|null|null|
|null|2   |2   |null|null|null|null|null|null|
|null|null|null|3   |null|null|null|null|null|
|null|null|null|null|5   |5   |5   |5   |5   |
|4   |null|null|null|null|null|null|null|null|
|null|2   |2   |2   |null|null|null|null|null|
|null|null|null|null|5   |5   |5   |5   |5   |
|5   |5   |5   |5   |5   |5   |5   |5   |5   |
+----+----+----+----+----+----+----+----+----+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 使用 Scala 根据前一行中不同列的计算值计算 Spark Dataframe 当前行中的列值 - Calculating column value in current row of Spark Dataframe based on the calculated value of a different column in previous row using Scala 在Spark Scala中将当前行中的前一行值求和 - sum previous row value in current row in spark scala 如何使用Spark Scala根据日期将行拆分为多行? - how to split row into multiple rows on the basis of date using spark scala? 使用Scala在Spark数据框中将单行分为两行 - Split single row into two rows in a spark dataframe using scala Scala Spark DataFrame 问题:如何通过将当前行中的值与前一行中的某处匹配来添加新列 - Scala Spark DataFrame Question:How to add new columns by matching the value in current row to somewhere from previous rows Spark-Scala:根据其他列的值创建拆分行 - Spark-Scala : Create split rows based on the value of other column Scala根据时间列将单行拆分为多行 - scala split single row to multiple rows based on time column Scala/Spark - 根据键查找行中值的总数 - Scala/Spark - Find total number of value in row based on a key 如何在 Spark Scala 中按第 n 个分隔符拆分行 - How to Split the row by nth delimiter in Spark Scala 使用动态列数更改数据框行值 - change a dataframe row value with dynamic number of columns spark scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM