[英]How to pair rows in SPARK dataframe based on timestamp range and row type
I have a dataframe similar to this:我有一个类似于此的 dataframe:
+------------------+---------+------------+
| Timestamp | RowType | Value |
+------------------+---------+------------+
| 2020. 6. 5. 8:12 | X | Null |
| 2020. 6. 5. 8:13 | Y | Null |
| 2020. 6. 5. 8:14 | Y | Null |
| 2020. 6. 5. 8:15 | A | SomeValue |
| 2020. 6. 5. 8:16 | Y | Null |
| 2020. 6. 5. 8:17 | Y | Null |
| 2020. 6. 5. 8:18 | X | Null |
| 2020. 6. 5. 8:19 | Y | Null |
| 2020. 6. 5. 8:20 | Y | Null |
| 2020. 6. 6. 8:21 | A | SomeValue2 |
| 2020. 6. 7. 8:22 | Y | Null |
| 2020. 6. 8. 8:23 | Y | Null |
| 2020. 6. 9. 8:24 | X | Null |
+------------------+---------+------------+
For each X typed row I want to select the value from the following A typed row.对于每个 X 类型的行,我想 select 来自以下 A 类型的行的值。 If there is no A typed row between two X typed, then the value of the X row should remain null.
如果两个 X 类型之间没有 A 类型行,则 X 行的值应保持为 null。
+------------------+---------+------------+
| Timestamp | RowType | Value |
+------------------+---------+------------+
| 2020. 6. 5. 8:12 | X | SomeValue |
| 2020. 6. 5. 8:18 | X | SomeValue2 |
| 2020. 6. 9. 8:24 | X | Null |
+------------------+---------+------------+
Is this possible using window functions?这可以使用 window 函数吗?
If RowType
contains only these values (X,Y,A) it should work:如果
RowType
仅包含这些值 (X,Y,A) 它应该可以工作:
df.filter('RowType=!="Y")
.select('Timestamp,'RowType,lag('Value,-1).over(Window.orderBy('Timestamp)).as("lag"))
.filter('RowType==="X")
.show()
output: output:
+----------------+-------+-----------+
| Timestamp|RowType| lag|
+----------------+-------+-----------+
|2020. 6. 5. 8:12| X|SomeValue |
|2020. 6. 5. 8:18| X|SomeValue2 |
|2020. 6. 9. 8:24| X| null|
+----------------+-------+-----------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.