简体   繁体   English

一些(null)到Stringtype可以为空的scala.matcherror

[英]Some(null) to Stringtype nullable scala.matcherror

I have an RDD[(Seq[String], Seq[String])] with some null values in the data. 我有一个RDD[(Seq[String], Seq[String])] ,数据中有一些空值。 The RDD converted to dataframe looks like this 转换为数据帧的RDD如下所示

+----------+----------+
|      col1|      col2|
+----------+----------+
|[111, aaa]|[xx, null]|
+----------+----------+

Following is the sample code: 以下是示例代码:

val rdd = sc.parallelize(Seq((Seq("111","aaa"),Seq("xx",null))))
val df = rdd.toDF("col1","col2")
val keys = Array("col1","col2")
val values = df.flatMap {
    case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)
    case Row(_, null) => None
}
val transposed = values.map(someFunc(keys))

val schema = StructType(keys.map(name => StructField(name, DataTypes.StringType, nullable = true)))

val transposedDf = sc.createDataFrame(transposed, schema)

transposed.show()

It runs fine until the point I create a transposedDF, however as soon as I hit show it throws the following error: 它运行正常,直到我创建transposedDF,但是只要我点击show它会抛出以下错误:

scala.MatchError: null
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:97)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)
        at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)
        at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492)

IF there are no null values in the rdd the code works fine. 如果rdd中没有空值,则代码可以正常工作。 I do not understand why does it fail when I have any null values, becauase I am specifying the schema of StringType with nullable as true. 我不明白为什么当我有任何空值时它会失败,因为我正在指定StringType的模式,其中nullable为true。 Am i doing something wrong? 难道我做错了什么? I am using spark 1.6.1 and scala 2.10 我使用spark 1.6.1和scala 2.10

Pattern match is performed linearly as it appears in the sources, so, this line: 模式匹配在源中出现时线性执行,因此,此行:

case Row(t1: Seq[String], t2: Seq[String]) => Some((t1 zip t2).toMap)

Which doesn't have any restrictions on the values of t1 and t2 never matter match with the null value. 对t1和t2的值没有任何限制,从不与空值匹配。

Effectively, put the null check before and it should work. 实际上,在之前进行空检查,它应该工作。

The issue is that whether you find null or not the first pattern matches. 问题是,无论您是否找到null ,第一个模式都匹配。 After all, t2: Seq[String] could theoretically be null . 毕竟, t2: Seq[String]理论上可以为null While it's true that you can solve this immediately by simply making the null pattern appear first, I feel it is imperative to use the facilities in the Scala language to get rid of null altogether and avoid more bad runtime surprises. 虽然你可以通过简单地首先显示null模式来立即解决这个问题,但我觉得必须使用Scala语言中的工具来完全摆脱null并避免更多糟糕的运行时意外。

So you could do something like this: 所以你可以这样做:

def foo(s: Seq[String]) = if (s.contains(null)) None else Some(s)
//or you could do fancy things with filter/filterNot

df.map {
   case (first, second) => (foo(first), foo(second))
}

This will provide you the Some / None tuples you seem to want, but I would see about flattening out those None s as well. 这将为您提供您似乎想要的Some / None元组,但我也会看到将这些None的内容弄平。

I think you will need to encode null values to blank or special String before performing assert operations. 我认为在执行断言操作之前,您需要将空值编码为空或特殊字符串。 Also keep in mind that Spark executes lazily. 还要记住,Spark执行懒惰。 So from the like "val values = df.flatMap" onward everything is executed only when show() is executed. 因此,从“val values = df.flatMap”开始,只有在执行show()时才执行所有操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM