简体   繁体   中英

Spark - How to concatenate rows from window with a stride value

I have a dataframe as follow (time series data):

    value category
    a1    c1
    a2    c1
    a3    c1
    a4    c1
    a5    c1
    a6    c1
    a7    c1
    a8    c1
    a1    c2
    a2    c2
    a3    c2
    a4    c2
    a5    c2
    a6    c2
    a7    c2
    a8    c2

What I want to do is using a window sliding, so with a window size = 4 and the striding = 2, which means the window contains 4 rows, and we move the window 2 rows. the expected result should be like:

    window value      category
    [a1, a2, a3, a4]    c1
    [a3, a4, a5, a6]    c1
    [a5, a6, a7, a8]    c1
    [a1, a2, a3, a4]    c2
    [a3, a4, a5, a6]    c2
    [a5, a6, a7, a8]    c2

I have tried with the window function. However, to the best of my knowledge, the window would iterate over all of the rows inside my Dataframe. The sample source code is:

# define the window spec with 4 rows following        
windowSpec = Window.partitionBy(col("category").orderBy(col("value ")).rowsBetween(0, 3)  
# get the window data
window_data = df.withColumn('window_data',collect_list(col("value")).over(windowSpec))

So the result would be like:

    window_data      category
    [a1, a2, a3, a4]    c1
    [a2, a3, a4, a5]    c1
    [a3, a4, a5, a6]    c1
    [a4, a5, a6, a7]    c1
    ...

Update: Actually, we could concatenate all the rows in a window for each row of the Dataframe, then filter only some rows in some specific positions. But this seems to cost much since we have to make two iterations for all the rows in the Dataframe and we have to concatenate the ones which we will ignore later. Intuitively, I think we could have a more optimized option.

Could you guys recommend any methods to get the result I want?

Thanks in advance :-)

Truong,

Here's a scala code (very similar to python) :

val window = Window.partitionBy(col("category")).orderBy("value")
val window_data = df.withColumn("window_data", collect_list(col("value")).over(window.rowsBetween(0,3)))
    .withColumn("rownum", row_number().over(window))
    .where(pmod($"rownum", lit(2))===1 && size(col("window_data")) === 4)
    .drop("rownum")

window_data.show(false)

Output :

+-----+--------+----------------+
|value|category|window_data     |
+-----+--------+----------------+
|a1   |c1      |[a1, a2, a3, a4]|
|a3   |c1      |[a3, a4, a5, a6]|
|a5   |c1      |[a5, a6, a7, a8]|
|a1   |c2      |[a1, a2, a3, a4]|
|a3   |c2      |[a3, a4, a5, a6]|
|a5   |c2      |[a5, a6, a7, a8]|
+-----+--------+----------------+

The idea is to compute each row's position through the same window as the one used for the window_data field and keep only odd rows. In addition we'll be droping rows for which we don't find 4 elements in window_data (instead of adding nulls).

Edit : Here's the query plan. It shows a single window is used for both rownum and window_data columns computation (no extra shuffle / sort).

== Physical Plan ==
*(2) Project [value#5, category#6, window_data#10]
+- *(2) Filter ((isnotnull(rownum#15) && (pmod(rownum#15, 2) = 1)) && (size(window_data#10) = 4))
   +- Window [collect_list(value#5, 0, 0) windowspecdefinition(category#6, value#5 ASC NULLS FIRST, specifiedwindowframe(RowFrame, currentrow$(), 3)) AS window_data#10, row_number() windowspecdefinition(category#6, value#5 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rownum#15], [category#6], [value#5 ASC NULLS FIRST]
      +- *(1) Sort [category#6 ASC NULLS FIRST, value#5 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(category#6, 200)
            +- LocalTableScan [value#5, category#6]

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM