[英]Parse the String column to get the data in date format using Spark Scala
I've the below column(TriggeredDateTime) in my.avro file which is of type String, i would need to get the data in yyyy-MM-dd HH:mm:ss format(as shown in the expected output) using Spark-Scala.我在 my.avro 文件中有以下列(TriggeredDateTime),它是字符串类型,我需要使用 Spark- 以 yyyy-MM-dd HH:mm:ss 格式(如预期输出所示)获取数据-斯卡拉。 Please could you let me know is there any way to achieve this by writing an UDF, rather than using my below approach.请让我知道是否有任何方法可以通过编写 UDF 来实现这一点,而不是使用我下面的方法。 Any help would be much appreciated.任何帮助将非常感激。
"TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}
expected output
_ _ _ _ _ _ _ _ _ _
|TriggeredDateTime |
|___________________|
|2019-05-16 04:56:19|
|_ _ _ _ _ _ _ _ _ _|
My Approach:我的方法:
I'm trying to convert.avro file to JSON format by applying the schema and then i can try parsing the JSON to get the required results.我正在尝试通过应用架构将.avro 文件转换为 JSON 格式,然后我可以尝试解析 JSON 以获得所需的结果。
DataFrame Sample Data: DataFrame 样本数据:
[{"vin":"FU7123456XXXXX","basetime":0,"dtctime":189834,"latitude":36.341587,"longitude":140.327676,"dtcs":[{"fmi":1,"spn":2631,"dtc":"470A01","id":1},{"fmi":0,"spn":0,"dtc":"000000","id":61}],"signals":[{"timestamp":78799,"spn":174,"value":45,"name":"PT"},{"timestamp":12345,"spn":0,"value":10.2,"name":"PT"},{"timestamp":194915,"spn":0,"value":0,"name":"PT"}],"sourceEcu":"MCM","TriggeredDateTime":{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}}]
DataFrame PrintSchema: DataFrame 打印架构:
initialDF.printSchema
root
|-- vin: string (nullable = true)
|-- basetime: string (nullable = true)
|-- dtctime: string (nullable = true)
|-- latitude: string (nullable = true)
|-- longitude: string (nullable = true)
|-- dtcs: string (nullable = true)
|-- signals: string (nullable = true)
|-- sourceEcu: string (nullable = true)
|-- dtcTriggeredDateTime: string (nullable = true)
Instead of writing an UDF you can use the build-in get_json_object
to parse the json row and format_string
to extract the desired output.您可以使用内置的get_json_object
来解析 json 行和format_string
以提取所需的 output,而不是编写 UDF。
import org.apache.spark.sql.functions.{get_json_object, format_string}
val df = Seq(
("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")
df.select(
format_string("%s-%s-%s %s:%s:%s",
get_json_object($"TriggeredDateTime", "$.dateTime.date.year").as("year"),
get_json_object($"TriggeredDateTime", "$.dateTime.date.month").as("month"),
get_json_object($"TriggeredDateTime", "$.dateTime.date.day").as("day"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.hour").as("hour"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.minute").as("min"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.second").as("sec")
).as("TriggeredDateTime")
).show(false)
Output: Output:
+-----------------+
|TriggeredDateTime|
+-----------------+
|2019-5-16 4:56:19|
|2018-5-16 4:56:19|
+-----------------+
The function get_json_object
will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year
which we add as param to format_string
function in order to generate the final date. The function get_json_object
will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year
which we add as param to format_string
function in order to generate the final date.
UPDATE:更新:
For the sake of completeness instead of calling multiple times get_json_object
we can use from_json
providing the schema which we already know:为了完整起见,我们可以使用from_json
提供我们已经知道的模式,而不是多次调用get_json_object
:
import org.apache.spark.sql.functions.{from_json, format_string}
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val df = Seq(
("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")
val schema =
StructType(Seq(
StructField("dateTime", StructType(Seq(
StructField("date",
StructType(Seq(
StructField("year", IntegerType, false),
StructField("month", IntegerType, false),
StructField("day", IntegerType, false)
)
)
),
StructField("time",
StructType(Seq(
StructField("hour", IntegerType, false),
StructField("minute", IntegerType, false),
StructField("second", IntegerType, false),
StructField("nano", IntegerType, false)
)
)
)
)
)
),
StructField("offset", StructType(Seq(
StructField("totalSeconds", IntegerType, false)
)
)
)
))
df.select(
from_json($"TriggeredDateTime", schema).as("parsedDateTime")
)
.select(
format_string("%s-%s-%s %s:%s:%s",
$"parsedDateTime.dateTime.date.year".as("year"),
$"parsedDateTime.dateTime.date.month".as("month"),
$"parsedDateTime.dateTime.date.day".as("day"),
$"parsedDateTime.dateTime.time.hour".as("hour"),
$"parsedDateTime.dateTime.time.minute".as("min"),
$"parsedDateTime.dateTime.time.second".as("sec")
).as("TriggeredDateTime")
)
.show(false)
// +-----------------+
// |TriggeredDateTime|
// +-----------------+
// |2019-5-16 4:56:19|
// |2018-5-16 4:56:19|
// +-----------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.