[英]Find max value from different columns in a single row in scala DataFrame
I tried to find out the max value from different columns in a single row in scala dataframe.我试图从 scala dataframe 的单行中的不同列中找出最大值。
The data available in dataframe is as below. dataframe 中可用的数据如下。
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
| NUM| SIG1| SIG2| SIG3|
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531001,"VALUE":4.7825}]|[{"TIME":1569560531002,"VALUE":2.7825}]|
|XXXXX01|[{"TIME":1569560541001,"VALUE":1.7825}]|[{"TIME":1569560541000,"VALUE":8.7825}]|[{"TIME":1569560541003,"VALUE":5.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531009,"VALUE":3.7825}]| null |
|XXXXX02|[{"TIME":1569560531000,"VALUE":5.7825}]|[{"TIME":1569560531007,"VALUE":8.7825}]|[{"TIME":1569560531006,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531009,"VALUE":1.7825}]|[{"TIME":1569560531010,"VALUE":3.7825}]|
and the schema is架构是
scala> DF.printSchema
root
|-- NUM: string (nullable = true)
|-- SIG1: string (nullable = true)
|-- SIG2: string (nullable = true)
|-- SIG3: string (nullable = true)
The expected output is as below.预期的 output 如下。
+-------+--------------+----------+------------+------------+
| NUM| TIME | SIG1| | SIG2 | SIG3 |
+-------+--------------+----------+------------+------------+
|XXXXX01| 1569560531002| 3.7825 | 4.7825 | 2.7825 |
|XXXXX01| 1569560541003| 1.7825 | 8.7825 | 5.7825 |
|XXXXX01| 1569560531009| 3.7825 | 3.7825 | null |
|XXXXX02| 1569560531007| 5.7825 | 8.7825 | 3.7825 |
|XXXXX02| 1569560531010| 9.7825 | 1.7825 | 3.7825 |
I need to add a new column with highest TIME from a single row and SIG columns with their value only.我需要从单行和 SIG 列中添加一个具有最高 TIME 的新列,仅包含它们的值。
Basically the TIME in each column will be replaced by the highest TIME value available in that row and explode the TIME and VALUEs.基本上,每列中的 TIME 将被该行中可用的最高 TIME 值替换,并分解 TIME 和 VALUE。
Is there any UDF/functions to achieve this?是否有任何UDF/功能来实现这一点? Thanks in Advance.
提前致谢。
Use get_json_object
function to extract values from json stored as a string.使用
get_json_object
function 从存储为字符串的 json 中提取值。
The it's quite straightforward:这很简单:
DF.withColumn("TIME", greatest(get_json_object('SIG1, "$[0].TIME"),
get_json_object('SIG2, "$[0].TIME"),
get_json_object('SIG3, "$[0].TIME")))
.withColumn("SIG1", get_json_object('SIG1, "$[0].VALUE"))
.withColumn("SIG2", get_json_object('SIG2, "$[0].VALUE"))
.withColumn("SIG3", get_json_object('SIG3, "$[0].VALUE"))
.show
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.