使用 Spark Scala 解析字符串列以获取日期格式的数据

Question

I've the below column(TriggeredDateTime) in my.avro file which is of type String, i would need to get the data in yyyy-MM-dd HH:mm:ss format(as shown in the expected output) using Spark-Scala.我在 my.avro 文件中有以下列（TriggeredDateTime），它是字符串类型，我需要使用 Spark- 以 yyyy-MM-dd HH:mm:ss 格式（如预期输出所示）获取数据-斯卡拉。 Please could you let me know is there any way to achieve this by writing an UDF, rather than using my below approach.请让我知道是否有任何方法可以通过编写 UDF 来实现这一点，而不是使用我下面的方法。 Any help would be much appreciated.任何帮助将非常感激。

 "TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}

   expected output
   _ _ _ _ _ _ _ _ _ _
  |TriggeredDateTime  |
  |___________________|
  |2019-05-16 04:56:19|
  |_ _ _ _ _ _ _ _ _ _|

My Approach:我的方法：

I'm trying to convert.avro file to JSON format by applying the schema and then i can try parsing the JSON to get the required results.我正在尝试通过应用架构将.avro 文件转换为 JSON 格式，然后我可以尝试解析 JSON 以获得所需的结果。

DataFrame Sample Data: DataFrame 样本数据：

[{"vin":"FU7123456XXXXX","basetime":0,"dtctime":189834,"latitude":36.341587,"longitude":140.327676,"dtcs":[{"fmi":1,"spn":2631,"dtc":"470A01","id":1},{"fmi":0,"spn":0,"dtc":"000000","id":61}],"signals":[{"timestamp":78799,"spn":174,"value":45,"name":"PT"},{"timestamp":12345,"spn":0,"value":10.2,"name":"PT"},{"timestamp":194915,"spn":0,"value":0,"name":"PT"}],"sourceEcu":"MCM","TriggeredDateTime":{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}}]

DataFrame PrintSchema: DataFrame 打印架构：

initialDF.printSchema
root
 |-- vin: string (nullable = true)
 |-- basetime: string (nullable = true)
 |-- dtctime: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- dtcs: string (nullable = true)
 |-- signals: string (nullable = true)
 |-- sourceEcu: string (nullable = true)
 |-- dtcTriggeredDateTime: string (nullable = true)

Answer 1

Instead of writing an UDF you can use the build-in get_json_object to parse the json row and format_string to extract the desired output.您可以使用内置的get_json_object来解析 json 行和format_string以提取所需的 output，而不是编写 UDF。

import org.apache.spark.sql.functions.{get_json_object, format_string}

val df = Seq(
  ("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
  ("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")

df.select(
  format_string("%s-%s-%s %s:%s:%s",
    get_json_object($"TriggeredDateTime", "$.dateTime.date.year").as("year"),
    get_json_object($"TriggeredDateTime", "$.dateTime.date.month").as("month"),
    get_json_object($"TriggeredDateTime", "$.dateTime.date.day").as("day"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.hour").as("hour"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.minute").as("min"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.second").as("sec")
  ).as("TriggeredDateTime")
).show(false)

Output: Output：

+-----------------+
|TriggeredDateTime|
+-----------------+
|2019-5-16 4:56:19|
|2018-5-16 4:56:19|
+-----------------+

The function get_json_object will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year which we add as param to format_string function in order to generate the final date. The function get_json_object will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year which we add as param to format_string function in order to generate the final date.

UPDATE:更新：

For the sake of completeness instead of calling multiple times get_json_object we can use from_json providing the schema which we already know:为了完整起见，我们可以使用from_json提供我们已经知道的模式，而不是多次调用get_json_object ：

import org.apache.spark.sql.functions.{from_json, format_string}
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val df = Seq(
  ("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
  ("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")

val schema = 
StructType(Seq(
  StructField("dateTime", StructType(Seq(
        StructField("date",
              StructType(Seq(
                  StructField("year", IntegerType, false),
                  StructField("month", IntegerType, false),
                  StructField("day", IntegerType, false)
                )
              )
        ),
        StructField("time",
              StructType(Seq(
                  StructField("hour", IntegerType, false),
                  StructField("minute", IntegerType, false),
                  StructField("second", IntegerType, false),
                  StructField("nano", IntegerType, false)
                )
              )
        )
      )
    )
  ),
  StructField("offset", StructType(Seq(
        StructField("totalSeconds", IntegerType, false)
      )         
    )
  )
))              


df.select(
    from_json($"TriggeredDateTime", schema).as("parsedDateTime")
)
.select(
    format_string("%s-%s-%s %s:%s:%s",
      $"parsedDateTime.dateTime.date.year".as("year"),
      $"parsedDateTime.dateTime.date.month".as("month"),
      $"parsedDateTime.dateTime.date.day".as("day"),
      $"parsedDateTime.dateTime.time.hour".as("hour"),
      $"parsedDateTime.dateTime.time.minute".as("min"),
      $"parsedDateTime.dateTime.time.second".as("sec")
    ).as("TriggeredDateTime")
)
.show(false)

// +-----------------+
// |TriggeredDateTime|
// +-----------------+
// |2019-5-16 4:56:19|
// |2018-5-16 4:56:19|
// +-----------------+

使用 Spark Scala 解析字符串列以获取日期格式的数据

问题描述

1 个解决方案

解决方案1
2 已采纳 2019-11-21 16:17:05

使用 Spark Scala 解析字符串列以获取日期格式的数据

问题描述

1 个解决方案

解决方案1 2 已采纳 2019-11-21 16:17:05

解决方案1
2 已采纳 2019-11-21 16:17:05