繁体   English   中英

如何将字符串的 DataFrame 转换为定义模式的 Dataframe

[英]How to convert DataFrame of string to a Dataframe of defined schema

我有一个 DataFrame 的字符串,每个字符串都是一个 JSON 元素。 我想将其转换为 dataframe。

{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|

这是我的输入模式: Input.printSchema

input: org.apache.spark.sql.DataFrame = [value: string]
root
 |-- value: string (nullable = true)

想要的是这样的:


root
|-- StartTime: integer (nullable = true)
|-- StatusCode: integer (nullable = true)
|-- integer: string (nullable = true)
|-- HTTPUserAgent: string (nullable = true)

我尝试创建一个结构 class 并从中创建一个 dataframe 但它抛出 ArrayIndexOutOfBoundsException。

spark.createDataFrame(input,simpleSchema).show

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 116.0 failed 4 times, most recent failure: Lost task 0.3 in stage 116.0 (TID 17471, ip-10-0-62-29.ec2.internal, executor 1030): java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true]).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true]), 0, Channel), StringType), true, false) AS Channel#947
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> df.printSchema
root
 |-- value: string (nullable = true)

scala> df.show(false)
+--------------------------------------------------------------------------------------------------------------------+
|value                                                                                                                |
+--------------------------------------------------------------------------------------------------------------------+
|{"StartTime":1649424816686069,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"} |
|{"StartTime":164981846249877,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}  |
|{"StartTime":16498172424241095,"StatusCode":200,"HTTPMethod":"GET","HTTPUserAgent":"Jakarta Commons-HttpClient/3.1"}|
+--------------------------------------------------------------------------------------------------------------------+


scala> val sch = spark.read.json(df.select("value").as[String].distinct).schema
sch: org.apache.spark.sql.types.StructType = StructType(StructField(HTTPMethod,StringType,true), StructField(HTTPUserAgent,StringType,true), StructField(StartTime,LongType,true), StructField(StatusCode,LongType,true))

scala> val df1 = df.withColumn("jsonData", from_json(col("value"), sch, Map.empty[String, String])).select(col("jsonData.*"))
df1: org.apache.spark.sql.DataFrame = [HTTPMethod: string, HTTPUserAgent: string ... 2 more fields]

scala> df1.show(false)
+----------+------------------------------+-----------------+----------+
|HTTPMethod|HTTPUserAgent                 |StartTime        |StatusCode|
+----------+------------------------------+-----------------+----------+
|GET       |Jakarta Commons-HttpClient/3.1|1649424816686069 |200       |
|GET       |Jakarta Commons-HttpClient/3.1|164981846249877  |200       |
|GET       |Jakarta Commons-HttpClient/3.1|16498172424241095|200       |
+----------+------------------------------+-----------------+----------+

scala> df1.printSchema
root
 |-- HTTPMethod: string (nullable = true)
 |-- HTTPUserAgent: string (nullable = true)
 |-- StartTime: long (nullable = true)
 |-- StatusCode: long (nullable = true)

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM