Sum of consecutive values in column of a Spark dataframe

Question

I have a dataframe

Hi,I have a dataframe as below

+-------+--------+
|id     |level   |
+-------+--------+
|    0  |   0    |
|    1  |   0    |
|    2  |   1    |
|    3  |   1    |
|    4  |   1    |
|    5  |   0    |
|    6  |   1    |
|    7  |   1    |
|    8  |   0    |
|    9  |   1    |
|   10  |   0    |
+-------+--------+

and I need the sum of consecutive 1's .SO the output should be 3,2,1.However the constraint in this scenario is that i do not need to use UDF Is there any in-built scala/spark function that can do this trick.I am not able to USE UDF

Answer 1

You could use row_number and count ( SQL/Dataframe API ), to count the number of consecutive values (repeat) in a column. The trick is to count the offset between the current row and the index of the occurrence of the consecutive targeted values.

Scala

var df = spark.createDataFrame(Seq((0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0))).toDF("id","level")
df.createOrReplaceTempView("DT")
var df_cnt = spark.sql("select level, count(*) from (select *, (row_number() over (order by id) - row_number() over (partition by level order by id) ) as grp from DT  order by id) as t where level !=0 group by grp, level ")
df_cnt.show()

The sequence of id must be maintained otherwise it will produce the wrong result.

Pyspark

df = spark.createDataFrame([(0,0),(1,0),(2,1),(3,1),(4,1),(5,0),(6,1),(7,1),(8,0),(9,1),(10,0)]).toDF("id","level")
df.createOrReplaceTempView('DF')
//same as before with spark.sql(...)

SQL

select level, count(*) from 
(select *, 
 (row_number() over (order by id) - 
  row_number() over (partition by level order by id) 
 ) as grp 
 from SDF  order by id) as t 
 where level !=0 
 group by grp, level

Intermediate sql computation detail (row offset, and grouping) :

Answer 2

You could do something like this:

val seq = Seq(0,0,1,1,1,0,1,1,0,1,0)

val seq1s = seq.foldLeft("")(_ + _).split("0")
seq1s.map(_.sliding(1).count(_ == "1"))

res: Array[Int] = Array(0, 0, 3, 2, 1)

If you don´t want the 0s there you could just filter them out using this instead:

seq1s.map(_.sliding(1).count(_ == "1")).filterNot(_ == 0)

res: Array[Int] = Array(3, 2, 1)

Sum of consecutive values in column of a Spark dataframe

Question

2 answers

solution1
2 2019-08-12 13:01:02

Scala

Pyspark

SQL

solution2
0 2019-08-12 10:29:53

Sum of consecutive values in column of a Spark dataframe

Question

2 answers

solution1 2 2019-08-12 13:01:02

Scala

Pyspark

SQL

solution2 0 2019-08-12 10:29:53

solution1
2 2019-08-12 13:01:02

solution2
0 2019-08-12 10:29:53