简体   繁体   中英

scala.MatchError during Spark 2.0.2 DataFrame union

I'm attempting to merge 2 DataFrames, one with old data and one with new data, using the union function. This used to work until I tried to dynamically add a new field to the old DataFrame because my schema is evolving.

This means that my old data will be missing a field and the new data will have it. In order for the union to work, I'm adding the field using the evolveSchema function below.

This resulted in the output/exception I pasted below the code, including my debug prints.

The column ordering and making fields nullable are attempts to fix this issue by making the DataFrames as identical as possible, but it persists. The schema prints show that they are both seemingly identical after these manipulations.

Any help to further debug this would be appreciated.

import org.apache.spark.sql.functions.lit
import org.apache.spark.sql.types.{StructField, StructType}
import org.apache.spark.sql.{DataFrame, SQLContext}

object Merger {

  def apply(sqlContext: SQLContext, oldDataSet: Option[DataFrame], newEnrichments: Option[DataFrame]): Option[DataFrame] = {

    (oldDataSet, newEnrichments) match {
      case (None, None) => None
      case (None, _) => newEnrichments
      case (Some(existing), None) => Some(existing)
      case (Some(existing), Some(news)) => Some {

        val evolvedOldDataSet = evolveSchema(existing)

        println("EVOLVED OLD SCHEMA FIELD NAMES:" + evolvedOldDataSet.schema.fieldNames.mkString(","))
        println("NEW SCHEMA FIELD NAMES:" + news.schema.fieldNames.mkString(","))

        println("EVOLVED OLD SCHEMA FIELD TYPES:" + evolvedOldDataSet.schema.fields.map(_.dataType).mkString(","))
        println("NEW SCHEMA FIELD TYPES:" + news.schema.fields.map(_.dataType).mkString(","))

        println("OLD SCHEMA")
        existing.printSchema();
        println("PRINT EVOLVED OLD SCHEMA")
        evolvedOldDataSet.printSchema()
        println("PRINT NEW SCHEMA")
        news.printSchema()

        val nullableEvolvedOldDataSet = setNullableTrue(evolvedOldDataSet)
        val nullableNews = setNullableTrue(news)

        println("NULLABLE EVOLVED OLD")
        nullableEvolvedOldDataSet.printSchema()
        println("NULLABLE NEW")
        nullableNews.printSchema()

        val unionData =nullableEvolvedOldDataSet.union(nullableNews)

        val result = unionData.sort(
          unionData("timestamp").desc
        ).dropDuplicates(
          Seq("id")
        )
        result.cache()
      }
    }
  }

  def GENRE_FIELD : String = "station_genre"

  // Handle missing fields in old data
  def evolveSchema(oldDataSet: DataFrame): DataFrame = {
    if (!oldDataSet.schema.fieldNames.contains(GENRE_FIELD)) {

      val columnAdded = oldDataSet.withColumn(GENRE_FIELD, lit("N/A"))

      // Columns should be in the same order for union
      val columnNamesInOrder = Seq("id", "station_id", "station_name", "station_timezone", "station_genre", "publisher_id", "publisher_name", "group_id", "group_name", "timestamp")
      val reorderedColumns = columnAdded.select(columnNamesInOrder.head, columnNamesInOrder.tail: _*)

      reorderedColumns
    }
    else
      oldDataSet
  }

  def setNullableTrue(df: DataFrame) : DataFrame = {
    // get schema
    val schema = df.schema
    // create new schema with all fields nullable
    val newSchema = StructType(schema.map {
      case StructField(columnName, dataType, _, metaData) => StructField( columnName, dataType, nullable = true, metaData)
    })
    // apply new schema
    df.sqlContext.createDataFrame( df.rdd, newSchema )
  }

}

EVOLVED OLD SCHEMA FIELD NAMES: id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp

NEW SCHEMA FIELD NAMES: id,station_id,station_name,station_timezone,station_genre,publisher_id,publisher_name,group_id,group_name,timestamp

EVOLVED OLD SCHEMA FIELD TYPES: StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType

NEW SCHEMA FIELD TYPES: StringType,LongType,StringType,StringType,StringType,LongType,StringType,LongType,StringType,LongType

OLD SCHEMA root |-- id: string (nullable = true) |-- station_id: long (nullable = true) |-- station_name: string (nullable = true) |-- station_timezone: string (nullable = true) |-- publisher_id: long (nullable = true) |-- publisher_name: string (nullable = true) |-- group_id: long (nullable = true) |-- group_name: string (nullable = true) |-- timestamp: long (nullable = true)

PRINT EVOLVED OLD SCHEMA root |-- id: string (nullable = true) |-- station_id: long (nullable = true) |-- station_name: string (nullable = true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = false) |-- publisher_id: long (nullable = true) |-- publisher_name: string (nullable = true) |-- group_id: long (nullable = true) |-- group_name: string (nullable = true) |-- timestamp: long (nullable = true)

PRINT NEW SCHEMA root |-- id: string (nullable = true) |-- station_id: long (nullable = true) |-- station_name: string (nullable = true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long (nullable = true) |-- publisher_name: string (nullable = true) |-- group_id: long (nullable = true) |-- group_name: string (nullable = true) |-- timestamp: long (nullable = true)

NULLABLE EVOLVED OLD root |-- id: string (nullable = true) |-- station_id: long (nullable = true) |-- station_name: string (nullable = true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long (nullable = true) |-- publisher_name: string (nullable = true) |-- group_id: long (nullable = true) |-- group_name: string (nullable = true) |-- timestamp: long (nullable = true)

NULLABLE NEW root |-- id: string (nullable = true) |-- station_id: long (nullable = true) |-- station_name: string (nullable = true) |-- station_timezone: string (nullable = true) |-- station_genre: string (nullable = true) |-- publisher_id: long (nullable = true) |-- publisher_name: string (nullable = true) |-- group_id: long (nullable = true) |-- group_name: string (nullable = true) |-- timestamp: long (nullable = true)

2017-01-18 15:59:32 ERROR org.apache.spark.internal.Logging$class Executor:91 - Exception in task 1.0 in stage 2.0 (TID 4) scala.MatchError: false (of class java.lang.Boolean) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296) at

...

com.companystuff.meta.uploader.Merger$.apply(Merger.scala:49)

...

Caused by: scala.MatchError: false (of class java.lang.Boolean) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:296) ...

It's because of ordering in the actual data even though its schema is the same. So simply select all required columns then do a union query.

Something like this:

val columns:Seq[String]= ....
val df = oldDf.select(columns:_*).union(newDf.select(columns:_*)

Hope it helps you

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM