简体   繁体   中英

Spark createDataFrame failing with ArrayOutOfBoundsException

I'm pretty new to Spark and am having a problem converting an RDD to a DataFrame. What I'm trying to do is take a log file, convert it to JSON using an existing jar (returns a string), and then make that resulting json into a dataframe. Here is what I have so far:

val serverLog = sc.textFile("/Users/Downloads/file1.log")
val jsonRows = serverLog.mapPartitions(partition => {
  val txfm = new JsonParser //*jar to parse logs to json*//
  partition.map(line => {
    Row(txfm.parseLine(line))
  })
})

When I run a take(2) on this I get something like:

[{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]
[{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}]

My problem comes here. I create a schema and try to create the df

val schema = StructType(Array(
  StructField("pwh",StringType,true),
  StructField("sVe",StringType,true),...))

val jsonDf = sqlSession.createDataFrame(jsonRows, schema)

And the returned error is

java.lang.RuntimeException: Error while encoding: java.lang.ArrayIndexOutOfBoundsException: 1
if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true) AS _pwh#0
+- if (assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt) null else staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, validateexternaltype(getexternalrowfield(assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object), 0, pwh), StringType), true)
:- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object).isNullAt
:  :- assertnotnull(input[0, org.apache.spark.sql.Row, true], top level row object)
:  :  +- input[0, org.apache.spark.sql.Row, true]
:  +- 0
:- null

Can someone tell me what I'm doing wrong here? Most of the SO answers I've found say I can use either createDataFrame or toDF() , but I've had no luck with either. I also tried converting the RDD to a JavaRDD , but that also did not work. Appreciate any insight you can give.

your defined schema is for RDD like:

{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}
{"pwh":"800","sVe":"10.0","psh":"1000","udt":"desktop"}

if you can change your RDD to make data as

{"logs": [{"pwh":"600","sVe":"10.0","psh":"667","udt":"mobile"}]}

an use this schema:

val schema = StructType(Seq(
  StructField("logs",ArrayType( StructType(Seq(
    StructField("pwh",StringType,true),
    StructField("sVe",StringType,true), ...))
  ))
))

sqlContext.read.schema(schema).json(jsonRows)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM