简体   繁体   中英

Spark streaming - Custom receiver and dataframe infer schema

Consider the below code snippet at receiver

val incomingMessage = subscriberSocket.recv(0)
val stringMessages = new String(incomingMessage).stripLineEnd.split(',')
store(Row.fromSeq(Array(stringMessages(0)) ++ stringMessages.drop(2)))

At receiver, I would not be wanting to convert the table (which is indicated by stringMessages(0) ) each of the column types to actual table types.

At main section of the code, when I do

val df = sqlContext.createDataFrame(eachGDNRdd,getSchemaAsStructField)
println(df.collect().length)

I get the below error

java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double
        at scala.runtime.BoxesRunTime.unboxToDouble(BoxesRunTime.java:119)
        at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow$class.getDouble(rows.scala:44)

Now, the schema consist of both String and Int field. I have cross verified, that field match by type. However, looks like spark dataframe is not inferring the type.


Question
1. Shouldn't spark infer the type of the schema, in the run time (unless there is a contradiction)?
2. Since the table is dynamic, the schema varies based on the first element of each row (which contains table name). Is there any simple suggested way to modify the schema on-the-fly?

Or Am i missing something obvious?

I'm new to Spark and you didn't say what version you're running, but in v2.1.0, schema inference is disabled by default due to the specific reason you mentioned; if the record structure is inconsistent, Spark can't reliably infer the schema. You can enable schema inference by setting spark.sql.streaming.schemaInference to true, but I think you're better off specifying the schema yourself.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM