簡體   English   中英

使用 Spark Scala 解析字符串列以獲取日期格式的數據

[英]Parse the String column to get the data in date format using Spark Scala

我在 my.avro 文件中有以下列(TriggeredDateTime),它是字符串類型,我需要使用 Spark- 以 yyyy-MM-dd HH:mm:ss 格式(如預期輸出所示)獲取數據-斯卡拉。 請讓我知道是否有任何方法可以通過編寫 UDF 來實現這一點,而不是使用我下面的方法。 任何幫助將非常感激。

 "TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}

   expected output
   _ _ _ _ _ _ _ _ _ _
  |TriggeredDateTime  |
  |___________________|
  |2019-05-16 04:56:19|
  |_ _ _ _ _ _ _ _ _ _|

我的方法:

我正在嘗試通過應用架構將.avro 文件轉換為 JSON 格式,然后我可以嘗試解析 JSON 以獲得所需的結果。

DataFrame 樣本數據:

[{"vin":"FU7123456XXXXX","basetime":0,"dtctime":189834,"latitude":36.341587,"longitude":140.327676,"dtcs":[{"fmi":1,"spn":2631,"dtc":"470A01","id":1},{"fmi":0,"spn":0,"dtc":"000000","id":61}],"signals":[{"timestamp":78799,"spn":174,"value":45,"name":"PT"},{"timestamp":12345,"spn":0,"value":10.2,"name":"PT"},{"timestamp":194915,"spn":0,"value":0,"name":"PT"}],"sourceEcu":"MCM","TriggeredDateTime":{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}}]

DataFrame 打印架構:

initialDF.printSchema
root
 |-- vin: string (nullable = true)
 |-- basetime: string (nullable = true)
 |-- dtctime: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- dtcs: string (nullable = true)
 |-- signals: string (nullable = true)
 |-- sourceEcu: string (nullable = true)
 |-- dtcTriggeredDateTime: string (nullable = true)

您可以使用內置的get_json_object來解析 json 行和format_string以提取所需的 output,而不是編寫 UDF。

import org.apache.spark.sql.functions.{get_json_object, format_string}

val df = Seq(
  ("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
  ("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")

df.select(
  format_string("%s-%s-%s %s:%s:%s",
    get_json_object($"TriggeredDateTime", "$.dateTime.date.year").as("year"),
    get_json_object($"TriggeredDateTime", "$.dateTime.date.month").as("month"),
    get_json_object($"TriggeredDateTime", "$.dateTime.date.day").as("day"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.hour").as("hour"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.minute").as("min"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.second").as("sec")
  ).as("TriggeredDateTime")
).show(false)

Output:

+-----------------+
|TriggeredDateTime|
+-----------------+
|2019-5-16 4:56:19|
|2018-5-16 4:56:19|
+-----------------+

The function get_json_object will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year which we add as param to format_string function in order to generate the final date.

更新:

為了完整起見,我們可以使用from_json提供我們已經知道的模式,而不是多次調用get_json_object

import org.apache.spark.sql.functions.{from_json, format_string}
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val df = Seq(
  ("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
  ("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")

val schema = 
StructType(Seq(
  StructField("dateTime", StructType(Seq(
        StructField("date",
              StructType(Seq(
                  StructField("year", IntegerType, false),
                  StructField("month", IntegerType, false),
                  StructField("day", IntegerType, false)
                )
              )
        ),
        StructField("time",
              StructType(Seq(
                  StructField("hour", IntegerType, false),
                  StructField("minute", IntegerType, false),
                  StructField("second", IntegerType, false),
                  StructField("nano", IntegerType, false)
                )
              )
        )
      )
    )
  ),
  StructField("offset", StructType(Seq(
        StructField("totalSeconds", IntegerType, false)
      )         
    )
  )
))              


df.select(
    from_json($"TriggeredDateTime", schema).as("parsedDateTime")
)
.select(
    format_string("%s-%s-%s %s:%s:%s",
      $"parsedDateTime.dateTime.date.year".as("year"),
      $"parsedDateTime.dateTime.date.month".as("month"),
      $"parsedDateTime.dateTime.date.day".as("day"),
      $"parsedDateTime.dateTime.time.hour".as("hour"),
      $"parsedDateTime.dateTime.time.minute".as("min"),
      $"parsedDateTime.dateTime.time.second".as("sec")
    ).as("TriggeredDateTime")
)
.show(false)

// +-----------------+
// |TriggeredDateTime|
// +-----------------+
// |2019-5-16 4:56:19|
// |2018-5-16 4:56:19|
// +-----------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM