简体   繁体   中英

Some(null) to Stringtype nullable scala.matcherror

I have an RDD[(Seq[String], Seq[String])] with some null values in the data. The RDD converted to dataframe looks like this

+----------+----------+
|      col1|      col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+

Following is the sample code:

val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
    case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
    case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))

val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))

val transposedDf = sc.createDataFrame(transposed, schema)

transposed.show()

It runs fine until the point I create a transposedDF, however as soon as I hit show it throws the following error:

scala.MatchError: null
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)

IF there are no null values in the rdd the code works fine. I do not understand why does it fail when I have any null values, becauase I am specifying the schema of StringType with nullable as true. Am i doing something wrong? I am using spark 1.6.1 and scala 2.10

Pattern match is performed linearly as it appears in the sources, so, this line:

case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)

Which doesn't have any restrictions on the values of t1 and t2 never matter match with the null value.

Effectively, put the null check before and it should work.

The issue is that whether you find null or not the first pattern matches. After all, t2: Seq[String] could theoretically be null . While it's true that you can solve this immediately by simply making the null pattern appear first, I feel it is imperative to use the facilities in the Scala language to get rid of null altogether and avoid more bad runtime surprises.

So you could do something like this:

def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot

df.map {
   case (first, second) => (foo(first), foo(second))
}

This will provide you the Some / None tuples you seem to want, but I would see about flattening out those None s as well.

I think you will need to encode null values to blank or special String before performing assert operations. Also keep in mind that Spark executes lazily. So from the like "val values = df.flatMap" onward everything is executed only when show() is executed.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM