![](/img/trans.png)
[英]In spark Data frame how to convert Date column of type string to Date column of type Date using scala
[英]Parse the String column to get the data in date format using Spark Scala
我在 my.avro 文件中有以下列(TriggeredDateTime),它是字符串类型,我需要使用 Spark- 以 yyyy-MM-dd HH:mm:ss 格式(如预期输出所示)获取数据-斯卡拉。 请让我知道是否有任何方法可以通过编写 UDF 来实现这一点,而不是使用我下面的方法。 任何帮助将非常感激。
"TriggeredDateTime": {"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}
expected output
_ _ _ _ _ _ _ _ _ _
|TriggeredDateTime |
|___________________|
|2019-05-16 04:56:19|
|_ _ _ _ _ _ _ _ _ _|
我的方法:
我正在尝试通过应用架构将.avro 文件转换为 JSON 格式,然后我可以尝试解析 JSON 以获得所需的结果。
DataFrame 样本数据:
[{"vin":"FU7123456XXXXX","basetime":0,"dtctime":189834,"latitude":36.341587,"longitude":140.327676,"dtcs":[{"fmi":1,"spn":2631,"dtc":"470A01","id":1},{"fmi":0,"spn":0,"dtc":"000000","id":61}],"signals":[{"timestamp":78799,"spn":174,"value":45,"name":"PT"},{"timestamp":12345,"spn":0,"value":10.2,"name":"PT"},{"timestamp":194915,"spn":0,"value":0,"name":"PT"}],"sourceEcu":"MCM","TriggeredDateTime":{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}}]
DataFrame 打印架构:
initialDF.printSchema
root
|-- vin: string (nullable = true)
|-- basetime: string (nullable = true)
|-- dtctime: string (nullable = true)
|-- latitude: string (nullable = true)
|-- longitude: string (nullable = true)
|-- dtcs: string (nullable = true)
|-- signals: string (nullable = true)
|-- sourceEcu: string (nullable = true)
|-- dtcTriggeredDateTime: string (nullable = true)
您可以使用内置的get_json_object
来解析 json 行和format_string
以提取所需的 output,而不是编写 UDF。
import org.apache.spark.sql.functions.{get_json_object, format_string}
val df = Seq(
("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")
df.select(
format_string("%s-%s-%s %s:%s:%s",
get_json_object($"TriggeredDateTime", "$.dateTime.date.year").as("year"),
get_json_object($"TriggeredDateTime", "$.dateTime.date.month").as("month"),
get_json_object($"TriggeredDateTime", "$.dateTime.date.day").as("day"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.hour").as("hour"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.minute").as("min"),
get_json_object($"TriggeredDateTime", "$.dateTime.time.second").as("sec")
).as("TriggeredDateTime")
).show(false)
Output:
+-----------------+
|TriggeredDateTime|
+-----------------+
|2019-5-16 4:56:19|
|2018-5-16 4:56:19|
+-----------------+
The function get_json_object
will convert the string json into a json object then with the proper selector we extract each part of the date ie: $.dateTime.date.year
which we add as param to format_string
function in order to generate the final date.
更新:
为了完整起见,我们可以使用from_json
提供我们已经知道的模式,而不是多次调用get_json_object
:
import org.apache.spark.sql.functions.{from_json, format_string}
import org.apache.spark.sql.types.{StructType, StructField, IntegerType}
val df = Seq(
("""{"dateTime":{"date":{"year":2019,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}"""),
("""{"dateTime":{"date":{"year":2018,"month":5,"day":16},"time":{"hour":4,"minute":56,"second":19,"nano":480389000}},"offset":{"totalSeconds":0}}""")
).toDF("TriggeredDateTime")
val schema =
StructType(Seq(
StructField("dateTime", StructType(Seq(
StructField("date",
StructType(Seq(
StructField("year", IntegerType, false),
StructField("month", IntegerType, false),
StructField("day", IntegerType, false)
)
)
),
StructField("time",
StructType(Seq(
StructField("hour", IntegerType, false),
StructField("minute", IntegerType, false),
StructField("second", IntegerType, false),
StructField("nano", IntegerType, false)
)
)
)
)
)
),
StructField("offset", StructType(Seq(
StructField("totalSeconds", IntegerType, false)
)
)
)
))
df.select(
from_json($"TriggeredDateTime", schema).as("parsedDateTime")
)
.select(
format_string("%s-%s-%s %s:%s:%s",
$"parsedDateTime.dateTime.date.year".as("year"),
$"parsedDateTime.dateTime.date.month".as("month"),
$"parsedDateTime.dateTime.date.day".as("day"),
$"parsedDateTime.dateTime.time.hour".as("hour"),
$"parsedDateTime.dateTime.time.minute".as("min"),
$"parsedDateTime.dateTime.time.second".as("sec")
).as("TriggeredDateTime")
)
.show(false)
// +-----------------+
// |TriggeredDateTime|
// +-----------------+
// |2019-5-16 4:56:19|
// |2018-5-16 4:56:19|
// +-----------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.