简体   繁体   English

使用 Spark Scala 解析字符串列以获取日期格式的数据

[英]Parse the String column to get the data in date format using Spark Scala

I've the below column(TriggeredDateTime) in my.avro file which is of type String, i would need to get the data in yyyy-MM-dd HH:mm:ss format(as shown in the expected output) using Spark-Scala.我在 my.avro 文件中有以下列(TriggeredDateTime),它是字符串类型,我需要使用 Spark- 以 yyyy-MM-dd HH:mm:ss 格式(如预期输出所示)获取数据-斯卡拉。 Please could you let me know is there any way to achieve this by writing an UDF, rather than using my below approach.请让我知道是否有任何方法可以通过编写 UDF 来实现这一点,而不是使用我下面的方法。 Any help would be much appreciated.任何帮助将非常感激。

 "TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}

   expected output
   _ _ _ _ _ _ _ _ _ _
  |TriggeredDateTime  |
  |___________________|
  |2019-05-16 04:56:19|
  |_ _ _ _ _ _ _ _ _ _|

My Approach:我的方法:

I'm trying to convert.avro file to JSON format by applying the schema and then i can try parsing the JSON to get the required results.我正在尝试通过应用架构将.avro 文件转换为 JSON 格式,然后我可以尝试解析 JSON 以获得所需的结果。

DataFrame Sample Data: DataFrame 样本数据:

[{"vin":"FU7123456XXXXX","basetime":0,"dtctime":189834,"latitude":36.341587,"longitude":140.327676,"dtcs":[{"fmi":1,"spn":2631,"dtc":"470A01","id":1},{"fmi":0,"spn":0,"dtc":"000000","id":61}],"signals":[{"timestamp":78799,"spn":174,"value":45,"name":"PT"},{"timestamp":12345,"spn":0,"value":10.2,"name":"PT"},{"timestamp":194915,"spn":0,"value":0,"name":"PT"}],"sourceEcu":"MCM","TriggeredDateTime":{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}}]

DataFrame PrintSchema: DataFrame 打印架构:

initialDF.printSchema
root
 |-- vin: string (nullable = true)
 |-- basetime: string (nullable = true)
 |-- dtctime: string (nullable = true)
 |-- latitude: string (nullable = true)
 |-- longitude: string (nullable = true)
 |-- dtcs: string (nullable = true)
 |-- signals: string (nullable = true)
 |-- sourceEcu: string (nullable = true)
 |-- dtcTriggeredDateTime: string (nullable = true)

Instead of writing an UDF you can use the build-in get_json_object to parse the json row and format_string to extract the desired output.您可以使用内置的get_json_object来解析 json 行和format_string以提取所需的 output,而不是编写 UDF。

import org.apache.spark.sql.functions.{get_json_object, format_string}

val df = Seq(
  ("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
  ("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")

df.select(
  format_string("%s-%s-%s %s:%s:%s",
    get_json_object($"TriggeredDateTime", "$.dateTime.date.year").as("year"),
    get_json_object($"TriggeredDateTime", "$.dateTime.date.month").as("month"),
    get_json_object($"TriggeredDateTime", "$.dateTime.date.day").as("day"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.hour").as("hour"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.minute").as("min"),
    get_json_object($"TriggeredDateTime", "$.dateTime.time.second").as("sec")
  ).as("TriggeredDateTime")
).show(false)

Output: Output:

+-----------------+
|TriggeredDateTime|
+-----------------+
|2019-5-16 4:56:19|
|2018-5-16 4:56:19|
+-----------------+

The function get_json_object will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year which we add as param to format_string function in order to generate the final date. The function get_json_object will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year which we add as param to format_string function in order to generate the final date.

UPDATE:更新:

For the sake of completeness instead of calling multiple times get_json_object we can use from_json providing the schema which we already know:为了完整起见,我们可以使用from_json提供我们已经知道的模式,而不是多次调用get_json_object

import org.apache.spark.sql.functions.{from_json, format_string}
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}

val df = Seq(
  ("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
  ("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")

val schema = 
StructType(Seq(
  StructField("dateTime", StructType(Seq(
        StructField("date",
              StructType(Seq(
                  StructField("year", IntegerType, false),
                  StructField("month", IntegerType, false),
                  StructField("day", IntegerType, false)
                )
              )
        ),
        StructField("time",
              StructType(Seq(
                  StructField("hour", IntegerType, false),
                  StructField("minute", IntegerType, false),
                  StructField("second", IntegerType, false),
                  StructField("nano", IntegerType, false)
                )
              )
        )
      )
    )
  ),
  StructField("offset", StructType(Seq(
        StructField("totalSeconds", IntegerType, false)
      )         
    )
  )
))              


df.select(
    from_json($"TriggeredDateTime", schema).as("parsedDateTime")
)
.select(
    format_string("%s-%s-%s %s:%s:%s",
      $"parsedDateTime.dateTime.date.year".as("year"),
      $"parsedDateTime.dateTime.date.month".as("month"),
      $"parsedDateTime.dateTime.date.day".as("day"),
      $"parsedDateTime.dateTime.time.hour".as("hour"),
      $"parsedDateTime.dateTime.time.minute".as("min"),
      $"parsedDateTime.dateTime.time.second".as("sec")
    ).as("TriggeredDateTime")
)
.show(false)

// +-----------------+
// |TriggeredDateTime|
// +-----------------+
// |2019-5-16 4:56:19|
// |2018-5-16 4:56:19|
// +-----------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在火花数据框中,如何使用 scala 将字符串类型的日期列转换为日期类型的日期列 - In spark Data frame how to convert Date column of type string to Date column of type Date using scala CAST(unix_timestamp(Column,Format)) 使用 scala 在 Spark 中显示错误的日期 - CAST(unix_timestamp(Column,Format)) showing wrong date in Spark using scala 如何验证spark scala中数据框列中的日期格式 - How to validate date format in a dataframe column in spark scala 在Scala Spark中更改日期格式后列的值更改 - Value of column changes after changing the Date format in scala spark 将带有时区列的日期字符串转换为 spark scala 中的时间戳 - Convert the date string with timezone column to timestamp in spark scala 使用 Spark-Scala 解析列中的 JSON 根 - Parse JSON root in a column using Spark-Scala Spark Scala:将10天添加到日期字符串(而不是列) - Spark Scala: Add 10 days to a date string (not a column) 如何使用scala将csv字符串解析为Spark数据帧? - How to parse a csv string into a Spark dataframe using scala? Scala:使用 spark 从 scylla 获取数据 - Scala: get data from scylla using spark 使用 spark 和 scala 将 TZ 时间戳字符串转换为 UTC 中的给定格式 - Converting TZ timestamp string to a given format in UTC using spark and scala
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM