I have a dataframe as follow (time series data):
value category
a1 c1
a2 c1
a3 c1
a4 c1
a5 c1
a6 c1
a7 c1
a8 c1
a1 c2
a2 c2
a3 c2
a4 c2
a5 c2
a6 c2
a7 c2
a8 c2
What I want to do is using a window sliding, so with a window size = 4 and the striding = 2, which means the window contains 4 rows, and we move the window 2 rows. the expected result should be like:
window value category
[a1, a2, a3, a4] c1
[a3, a4, a5, a6] c1
[a5, a6, a7, a8] c1
[a1, a2, a3, a4] c2
[a3, a4, a5, a6] c2
[a5, a6, a7, a8] c2
I have tried with the window function. However, to the best of my knowledge, the window would iterate over all of the rows inside my Dataframe. The sample source code is:
# define the window spec with 4 rows following
windowSpec = Window.partitionBy(col("category").orderBy(col("value ")).rowsBetween(0, 3)
# get the window data
window_data = df.withColumn('window_data',collect_list(col("value")).over(windowSpec))
So the result would be like:
window_data category
[a1, a2, a3, a4] c1
[a2, a3, a4, a5] c1
[a3, a4, a5, a6] c1
[a4, a5, a6, a7] c1
...
Update: Actually, we could concatenate all the rows in a window for each row of the Dataframe, then filter only some rows in some specific positions. But this seems to cost much since we have to make two iterations for all the rows in the Dataframe and we have to concatenate the ones which we will ignore later. Intuitively, I think we could have a more optimized option.
Could you guys recommend any methods to get the result I want?
Thanks in advance :-)
Truong,
Here's a scala code (very similar to python) :
val window = Window.partitionBy(col("category")).orderBy("value")
val window_data = df.withColumn("window_data", collect_list(col("value")).over(window.rowsBetween(0,3)))
.withColumn("rownum", row_number().over(window))
.where(pmod($"rownum", lit(2))===1 && size(col("window_data")) === 4)
.drop("rownum")
window_data.show(false)
Output :
+-----+--------+----------------+
|value|category|window_data |
+-----+--------+----------------+
|a1 |c1 |[a1, a2, a3, a4]|
|a3 |c1 |[a3, a4, a5, a6]|
|a5 |c1 |[a5, a6, a7, a8]|
|a1 |c2 |[a1, a2, a3, a4]|
|a3 |c2 |[a3, a4, a5, a6]|
|a5 |c2 |[a5, a6, a7, a8]|
+-----+--------+----------------+
The idea is to compute each row's position through the same window as the one used for the window_data
field and keep only odd rows. In addition we'll be droping rows for which we don't find 4 elements in window_data (instead of adding nulls).
Edit : Here's the query plan. It shows a single window is used for both rownum
and window_data
columns computation (no extra shuffle / sort).
== Physical Plan ==
*(2) Project [value#5, category#6, window_data#10]
+- *(2) Filter ((isnotnull(rownum#15) && (pmod(rownum#15, 2) = 1)) && (size(window_data#10) = 4))
+- Window [collect_list(value#5, 0, 0) windowspecdefinition(category#6, value#5 ASC NULLS FIRST, specifiedwindowframe(RowFrame, currentrow$(), 3)) AS window_data#10, row_number() windowspecdefinition(category#6, value#5 ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rownum#15], [category#6], [value#5 ASC NULLS FIRST]
+- *(1) Sort [category#6 ASC NULLS FIRST, value#5 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(category#6, 200)
+- LocalTableScan [value#5, category#6]
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.