简体   繁体   中英

How to get value of previous row in scala apache rdd[row]?

I need to get value from previous or next row while Im iterating through RDD[Row]

(10,1,string1)
(11,1,string2)
(21,1,string3)
(22,1,string4)

I need to sum strings for rows where difference between 1st value is not higher than 3. 2nd value is ID. So the result should be:

(1, string1string2)
(1, string3string4)

I tried use groupBy, reduce, partitioning but still I can't achieve what I want.

I'm trying to make something like this(I know it's not proper way):

rows.groupBy(row => {
      row(1)
    }).map(rowList => {
      rowList.reduce((acc, next) => {
        diff = next(0) - acc(0)
        if(diff <= 3){
          val strings = acc(2) + next(2)
          (acc(1), strings)
        }else{
          //create new group to aggregatre strings
          (acc(1), acc(2))
        }
      })
    })

I wonder if my idea is proper to solve this problem. Looking for help!

I think you can use sqlContext to Solve your problem by using lag function

Create RDD:

val rdd = sc.parallelize(List(
(10, 1, "string1"),
(11, 1, "string2"),
(21, 1, "string3"),
(22, 1, "string4"))
)

Create DataFrame:

val df = rdd.map(rec => (rec._1.toInt, rec._2.toInt, rec._3.toInt)).toDF("a", "b", "c")

Register your Dataframe:

df.registerTempTable("df")

Query the result:

val res = sqlContext.sql("""
SELECT CASE WHEN l < 3 THEN ROW_NUMBER() OVER (ORDER BY b) - 1
ELSE ROW_NUMBER() OVER (ORDER BY b)
END m, b, c 
FROM (
SELECT b,
(a - CASE WHEN lag(a, 1) OVER (ORDER BY a) is not null
THEN lag(a, 1) OVER (ORDER BY a)
ELSE 0
END) l, c
FROM df) A 
""")

Show the Results:

res.show

I Hope this will Help.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM