简体   繁体   English

使用Spark Scala将嵌套JSON中的字符串变量转换为datetime

[英]Convert a string variable in nested JSON to datetime using Spark Scala

I have a nested JSON dataframe in Spark which looks like below 我在Spark中有一个嵌套的JSON数据框,如下所示

root
 |-- data: struct (nullable = true)
 |    |-- average: long (nullable = true)
 |    |-- sum: long (nullable = true)
 |    |-- time: string (nullable = true)
 |-- password: string (nullable = true)
 |-- url: string (nullable = true)
 |-- username: string (nullable = true)

I need to convert the time variable under the data struct to timestamp data type. 我需要将数据结构下的时间变量转换为时间戳数据类型。 Following is the code I tried, but did not give me the result i wanted. 以下是我试过的代码,但没有给我我想要的结果。

val jsonStr = """{
"url": "imap.yahoo.com",
"username": "myusername",
"password": "mypassword",
"data": {
"time":"2017-1-29 0-54-32",
"average": 234,
"sum": 123}}"""


  val json: JsValue = Json.parse(jsonStr)

  import sqlContext.implicits._
  val rdd = sc.parallelize(jsonStr::Nil);
  var df = sqlContext.read.json(rdd);
  df.printSchema()
  val dfRes = df.withColumn("data",makeTimeStamp(unix_timestamp(df("data.time"),"yyyy-MM-dd hh-mm-ss").cast("timestamp")))
  dfRes.printSchema();

case class Convert(time: java.sql.Timestamp)
val makeTimeStamp = udf((time: java.sql.Timestamp) => Convert(
  time))

Result of my code: 我的代码结果:

root
 |-- data: struct (nullable = true)
 |    |-- time: timestamp (nullable = true)
 |-- password: string (nullable = true)
 |-- url: string (nullable = true)
 |-- username: string (nullable = true)

My code is actually removing the other elements inside the data struct(which are average and sum) instead of just casting the time string to timestamp data type. 我的代码实际上是删除数据结构中的其他元素(平均值和求和),而不是仅将时间字符串转换为时间戳数据类型。 For basic data management operations on JSON dataframes, Do we need to write UDF as and when we need a functionality or is there a library available for JSON data management. 对于JSON数据帧的基本数据管理操作,我们是否需要在需要功能时编写UDF,或者是否有可用于JSON数据管理的库。 I am currently using Play framework for working with JSON objects in spark. 我目前正在使用Play框架来处理spark中的JSON对象。 Thanks in advance. 提前致谢。

You can try this: 你可以试试这个:

val jsonStr = """{
"url": "imap.yahoo.com",
"username": "myusername",
"password": "mypassword",
"data": {
"time":"2017-1-29 0-54-32",
"average": 234,
"sum": 123}}"""


val json: JsValue = Json.parse(jsonStr)

import sqlContext.implicits._
val rdd = sc.parallelize(jsonStr::Nil);
var df = sqlContext.read.json(rdd);
df.printSchema()
val dfRes = df.withColumn("data",makeTimeStamp(unix_timestamp(df("data.time"),"yyyy-MM-dd hh-mm-ss").cast("timestamp"), df("data.average"), df("data.sum")))

case class Convert(time: java.sql.Timestamp, average: Long, sum: Long)
val makeTimeStamp = udf((time: java.sql.Timestamp, average: Long, sum: Long) => Convert(time, average, sum))

This will give the result: 这将给出结果:

root
|-- url: string (nullable = true)
|-- username: string (nullable = true)
|-- password: string (nullable = true)
|-- data: struct (nullable = true)
|    |-- time: timestamp (nullable = true)
|    |-- average: long (nullable = false)
|    |-- sum: long (nullable = false)

The only thing changed is Convert case class and makeTimeStamp UDF. 唯一改变的是Convert case class和makeTimeStamp UDF。

Assuming you can specify the Spark schema upfront, the automatic string-to-timestamp type coercion should take care of the conversions. 假设您可以预先指定Spark模式,则自动字符串到时间戳类型强制应该处理转换。

import org.apache.spark.sql.types._
val dschema = (new StructType).add("url", StringType).add("username", StringType).add
           ("data", (new StructType).add("sum", LongType).add("time", TimestampType))
val df = spark.read.schema(dschema).json("/your/json/on/hdfs")
df.printSchema
df.show

This article outlines a few more techniques to deal with bad data; 本文概述了一些处理不良数据的技术; worth a read for your use-case. 值得一读的用例。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM