scala.MatchError: in Dataframes

Question

I have one Spark (version 1.3.1) application. In which, I am trying to convert one Java bean RDD JavaRDD<Message> into Dataframe, it has many fields with different-different Data types (Integer, String, List, Map, Double).

But when, I am executing my Code.

messages.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){
            @Override
            public Void call(JavaRDD<Message> arg0, Time arg1) throws Exception {
                SQLContext sqlContext = SparkConnection.getSqlContext();
                DataFrame df = sqlContext.createDataFrame(arg0, Message.class);
                df.registerTempTable("messages");

I got this error

/06/12 17:27:40 INFO JobScheduler: Starting job streaming job 1434110260000 ms.0 from job set of time 1434110260000 ms
15/06/12 17:27:40 ERROR JobScheduler: Error running job streaming job 1434110260000 ms.1
scala.MatchError: interface java.util.List (of class java.lang.Class)
    at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
    at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
    at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
    at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
    at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
    at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
    at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192)
    at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437)
    at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465)

Answer 1

If Message has many different fields like List and the error message points to a List match error than that is the is the issue. Also if you look at the source code you can see that List is not in the match.

But beside digging around in the source code this is also very clearly stated in the documentation here under the Java tab :

Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays.

You may want to switch to Scala as it seems to be supported there:

Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table.

So the solution is either to use Scala or remove the List from you JavaBean.

As a last resort you can take a look at SQLUserDefinedType to define how that List should be persisted, maybe it's possible to hack it together.

Answer 2

I resolved this problem by updating my Spark version from 1.3.1 to 1.4.0 . Now, It works file.

scala.MatchError: in Dataframes

Question

2 answers

solution1
5 ACCPTED 2015-06-12 13:27:58

solution2
2 2015-06-12 16:00:33

scala.MatchError: in Dataframes

Question

2 answers

solution1 5 ACCPTED 2015-06-12 13:27:58

solution2 2 2015-06-12 16:00:33

solution1
5 ACCPTED 2015-06-12 13:27:58

solution2
2 2015-06-12 16:00:33