Spark Dataframe access of previous calculated row

Question

I have following Data:

+-----+-----+----+
|Col1 |t0   |t1  |
+-----+-----+----+
| A   |null |20  |
| A   |20   |40  |
| B   |null |10  |
| B   |10   |20  |
| B   |20   |120 |
| B   |120  |140 |
| B   |140  |320 |
| B   |320  |340 |
| B   |340  |360 |
+-----+-----+----+

And what I want is something like this:

+-----+-----+----+----+
|Col1 |t0   |t1  |grp |
+-----+-----+----+----+
| A   |null |20  |1A  |
| A   |20   |40  |1A  |
| B   |null |10  |1B  |
| B   |10   |20  |1B  |
| B   |20   |120 |2B  |
| B   |120  |140 |2B  |
| B   |140  |320 |3B  |
| B   |320  |340 |3B  |
| B   |340  |360 |3B  |
+-----+-----+----+----+

Explanation: The extra column is based on the Col1 and the difference between t1 and t0. When the difference between that two is too high => a new number is generated. (in the dataset above when the difference is greater than 50)

I build t0 with:

val windowSpec = Window.partitionBy($"Col1").orderBy("t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)

Can someone help me how to do it? I searched but didn't get a good idea. I'm a little bit lost because I need the value of the previous calculated row of grp...

Thanks

Answer 1

I solved it myself

val grp =  (coalesce(
      ($"t" - lag($"t", 1).over(windowSpec)),
      lit(0)
    ) > 50).cast("bigint")

df = df.withColumn("grp", sum(grp).over(windowSpec))

With this I don't need both colums (t0 and t1) anymore but can use only t1 (or t) without compute t0.

(I only need to add the value of Col1 but the most important part the number is done and works fine.)

I got the solution from: Spark SQL window function with complex condition

thanks for your help

Answer 2

You can use udf function to generate the grp column

def testUdf = udf((col1: String, t0: Int, t1: Int)=> (t1-t0) match {
  case x : Int if(x > 50) => 2+col1
  case _ => 1+col1
})

Call the udf function as

df.withColumn("grp", testUdf($"Col1", $"t0", $"t1"))

The udf function above won't work properly due to null values in t0 which can be replaced by 0

df.na.fill(0)

I hope this is the answer you are searching for.

Edited
Here's the complete solution using udaf . The process is complex . You've already got easy answer but it might help somebody who might use it
First defining udaf

class Boendal extends UserDefinedAggregateFunction {

  def inputSchema = new StructType().add("Col1", StringType).add("t0", IntegerType).add("t1", IntegerType).add("rank", IntegerType)
  def bufferSchema = new StructType().add("buff", StringType).add("buffer1", IntegerType)
  def dataType = StringType
  def deterministic = true

  def initialize(buffer: MutableAggregationBuffer) = {
    buffer.update(0, "")
    buffer.update(1, 0)
  }

  def update(buffer: MutableAggregationBuffer, input: Row) = {
    if (!input.isNullAt(0)) {
      val buff = buffer.getString(0)
      val col1 = input.getString(0)
      val t0 = input.getInt(1)
      val t1 = input.getInt(2)
      val rank = input.getInt(3)
      var value = 1
      if((t1-t0) < 50)
        value = 1
      else
        value = (t1-t0)/50

      val lastValue = buffer(1).asInstanceOf[Integer]
      // if(!buff.isEmpty) {
      if (value < lastValue)
        value = lastValue
      // }
      buffer.update(1, value)

      var finalString = ""
      if(buff.isEmpty){
        finalString = rank+";"+value+col1
      }
      else
        finalString = buff+"::"+rank+";"+value+col1

      buffer.update(0, finalString)
    }
  }

  def merge(buffer1: MutableAggregationBuffer, buffer2: Row) = {
    val buff1 = buffer1.getString(0)
    val buff2 = buffer2.getString(0)
    buffer1.update(0, buff1+buff2)
  }

  def evaluate(buffer: Row) : String = {
    buffer.getString(0)
  }
}

Then some udfs

def rankUdf = udf((grp: String)=> grp.split(";")(0))
def removeRankUdf = udf((grp: String) => grp.split(";")(1))

And finally call the udaf and udfs

val windowSpec = Window.partitionBy($"Col1").orderBy($"t1")
df = df.withColumn("t0", lag("t1", 1) over windowSpec)
  .withColumn("rank", rank() over windowSpec)
df = df.na.fill(0)
val boendal = new Boendal
val df2 = df.groupBy("Col1").agg(boendal($"Col1", $"t0", $"t1", $"rank").as("grp2")).withColumnRenamed("Col1", "Col2")
    .withColumn("grp2", explode(split($"grp2", "::")))
    .withColumn("rank2", rankUdf($"grp2"))
    .withColumn("grp2", removeRankUdf($"grp2"))

df = df.join(df2, df("Col1") === df2("Col2") && df("rank") === df2("rank2"))
  .drop("Col2", "rank", "rank2")
df.show(false)

Hope it helps

Spark Dataframe access of previous calculated row

Question

2 answers

solution1
2 ACCPTED 2017-05-18 06:33:40

solution2
1 2017-05-17 10:06:11

Spark Dataframe access of previous calculated row

Question

2 answers

solution1 2 ACCPTED 2017-05-18 06:33:40

solution2 1 2017-05-17 10:06:11

solution1
2 ACCPTED 2017-05-18 06:33:40

solution2
1 2017-05-17 10:06:11