Infer Schema from rdd to Dataframe in Spark Scala

Question

This question is a reference from ( Spark - creating schema programmatically with different data types )

I am trying infer schema from rdd to Dataframe, Below is my code

 def inferType(field: String) = field.split(":")(1) match {
    case "Integer" => IntegerType
    case "Double" => DoubleType
    case "String" => StringType
    case "Timestamp" => TimestampType
    case "Date" => DateType
    case "Long" => LongType
    case _ => StringType
 }


val header = c1:String|c2:String|c3:Double|c4:Integer|c5:String|c6:Timestamp|c7:Long|c8:Date

val df1 = Seq(("a|b|44.44|5|c|2018-01-01 01:00:00|456|2018-01-01")).toDF("data")
val rdd1 = df1.rdd.map(x => Row(x.getString(0).split("\\|"): _*))

val schema = StructType(header.split("\\|").map(column => StructField(column.split(":")(0), inferType(column), true)))
val df = spark.createDataFrame(rdd1, schema)
df.show()

When I do the show, it throws the below error. I have to perform this operation on larger scale data and having trouble finding the right solution, can you anybody please help me find a solution for this or any other way, where I can achieve this.

java.lang.RuntimeException: Error while encoding: java.lang.RuntimeException: java.lang.String is not a valid external type for schema of int

Thanks in advance

Answer 1

Short answer: String/Text cannot be specified with custom types/formats.

What you are trying to do is that to parse string as sql columns. The difference from other example is that loads from csv, you are trying to just. Working version can be achieved like this:

// skipped other details such as schematype, spark session...

val header = "c1:String|c2:String|c3:Double|c4:Integer"

// Create `Row` from `Seq`
val row = Row.fromSeq(Seq("a|b|44.44|12|"))

// Create `RDD` from `Row`
val rdd: RDD[Row] = spark.sparkContext
  .makeRDD(List(row))
  .map { row =>
    row.getString(0).split("\\|") match {
      case Array(col1, col2, col3, col4) =>
        Row.fromTuple(col1, col2, col3.toDouble, col4.toInt)
    }
  }
val stt: StructType = StructType(
  header
    .split("\\|")
    .map(column => StructField(column, inferType(column), true))
)

val dataFrame = spark.createDataFrame(rdd, stt)
dataFrame.show()

The reason to create a Row from Scala types is that introducing compatible types or Row respected types here.
Note I skipped date and time related fields, date conversions are tricky. You can check my another answer how to use formatted date and timestamps here

Infer Schema from rdd to Dataframe in Spark Scala

Question

1 answers

solution1
0 ACCPTED 2020-04-12 05:01:08

Infer Schema from rdd to Dataframe in Spark Scala

Question

1 answers

solution1 0 ACCPTED 2020-04-12 05:01:08

solution1
0 ACCPTED 2020-04-12 05:01:08