从 scala DataFrame 的单行中的不同列中查找最大值

Question

I tried to find out the max value from different columns in a single row in scala dataframe.我试图从 scala dataframe 的单行中的不同列中找出最大值。

The data available in dataframe is as below. dataframe 中可用的数据如下。

+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|    NUM|                                   SIG1|                                   SIG2|                                   SIG3|
+-------+---------------------------------------+---------------------------------------+---------------------------------------+
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531001,"VALUE":4.7825}]|[{"TIME":1569560531002,"VALUE":2.7825}]|
|XXXXX01|[{"TIME":1569560541001,"VALUE":1.7825}]|[{"TIME":1569560541000,"VALUE":8.7825}]|[{"TIME":1569560541003,"VALUE":5.7825}]|
|XXXXX01|[{"TIME":1569560531000,"VALUE":3.7825}]|[{"TIME":1569560531009,"VALUE":3.7825}]|        null                           |
|XXXXX02|[{"TIME":1569560531000,"VALUE":5.7825}]|[{"TIME":1569560531007,"VALUE":8.7825}]|[{"TIME":1569560531006,"VALUE":3.7825}]|
|XXXXX02|[{"TIME":1569560531000,"VALUE":9.7825}]|[{"TIME":1569560531009,"VALUE":1.7825}]|[{"TIME":1569560531010,"VALUE":3.7825}]|

and the schema is架构是

scala> DF.printSchema
root
 |-- NUM: string (nullable = true)
 |-- SIG1: string (nullable = true)
 |-- SIG2: string (nullable = true)
 |-- SIG3: string (nullable = true)

The expected output is as below.预期的 output 如下。


+-------+--------------+----------+------------+------------+
|    NUM|      TIME    | SIG1|    |  SIG2      |  SIG3      |
+-------+--------------+----------+------------+------------+
|XXXXX01| 1569560531002| 3.7825   | 4.7825     | 2.7825     |
|XXXXX01| 1569560541003| 1.7825   | 8.7825     | 5.7825     |
|XXXXX01| 1569560531009| 3.7825   | 3.7825     | null       |
|XXXXX02| 1569560531007| 5.7825   | 8.7825     | 3.7825     |
|XXXXX02| 1569560531010| 9.7825   | 1.7825     | 3.7825     |

I need to add a new column with highest TIME from a single row and SIG columns with their value only.我需要从单行和 SIG 列中添加一个具有最高 TIME 的新列，仅包含它们的值。

Basically the TIME in each column will be replaced by the highest TIME value available in that row and explode the TIME and VALUEs.基本上，每列中的 TIME 将被该行中可用的最高 TIME 值替换，并分解 TIME 和 VALUE。

Is there any UDF/functions to achieve this?是否有任何UDF/功能来实现这一点？ Thanks in Advance.提前致谢。

Answer 1

Use get_json_object function to extract values from json stored as a string.使用get_json_object function 从存储为字符串的 json 中提取值。

The it's quite straightforward:这很简单：

DF.withColumn("TIME", greatest(get_json_object('SIG1, "$[0].TIME"),
                               get_json_object('SIG2, "$[0].TIME"),
                               get_json_object('SIG3, "$[0].TIME")))
  .withColumn("SIG1", get_json_object('SIG1, "$[0].VALUE"))
  .withColumn("SIG2", get_json_object('SIG2, "$[0].VALUE"))
  .withColumn("SIG3", get_json_object('SIG3, "$[0].VALUE"))
  .show

从 scala DataFrame 的单行中的不同列中查找最大值

问题描述

1 个解决方案

解决方案1
0 2019-10-16 09:34:41

从 scala DataFrame 的单行中的不同列中查找最大值

问题描述

1 个解决方案

解决方案1 0 2019-10-16 09:34:41

解决方案1
0 2019-10-16 09:34:41