[英]Hive: How to find out a value base on previous row's value?
我有一个物联网风格的数据。 我必须用来自该“无”的最近时间的值替换“无”(该最近时间的值不是“无”)。
原始数据:
+---------------------+--------+
| time | value |
|---------------------+--------+
| 2020-01-01 11:11:10 | "0.3" |
| 2020-01-01 11:11:11 | "0.2" |
| 2020-01-01 11:11:12 | "none" |
| 2020-01-01 11:11:13 | "none" |
| 2020-01-01 11:11:14 | "none" |
| 2020-01-01 11:11:15 | "0.1" |
| 2020-01-01 11:11:16 | "none" |
| 2020-01-01 11:11:17 | "0.4" |
+---------------------+--------+
最终数据是这样的
+---------------------+--------+
| time | value |
|---------------------+--------+
| 2020-01-01 11:11:10 | "0.3" |
| 2020-01-01 11:11:11 | "0.2" |
| 2020-01-01 11:11:12 | "0.2" |
| 2020-01-01 11:11:13 | "0.2" |
| 2020-01-01 11:11:14 | "0.2" |
| 2020-01-01 11:11:15 | "0.1" |
| 2020-01-01 11:11:16 | "0.1" |
| 2020-01-01 11:11:17 | "0.4" |
+---------------------+--------+
让我假设“无价值”实际上是NULL
。 然后你想要LAG(IGNORE NULLS)
,但 Hive 不支持。 但是你可以通过两个步骤来做到这一点。 通过计算每行“真实”值的数量来识别组然后使用窗口函数分配值:
select t.*, max(value) over (partition by grp)
from (select t.*,
count(value) over (order by time) as grp
from t
) t
编辑:
如果您实际上将值存储为字符串,并且'none'
是真实值,则只需使用上述变体:
select t.*,
max(nullif(value, 'none')) over (partition by grp)
from (select t.*,
count(nullif(value, 'none')) over (order by time) as grp
from t
) t
您的问题类似于在 HIVE 中使用 COALESCE 将 Null 值替换为相同的列值
有一个细微的区别:
with rank_table as (
select *, SUM(value) OVER (ORDER BY time ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) as rnk
from your_table
)
select *, max(value) over (partition by rnk)
from rank_table
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.