I have one Spark (version 1.3.1)
application. In which, I am trying to convert one Java bean RDD
JavaRDD<Message>
into Dataframe, it has many fields with different-different Data types (Integer, String, List, Map, Double).
But when, I am executing my Code.
messages.foreachRDD(new Function2<JavaRDD<Message>,Time,Void>(){
@Override
public Void call(JavaRDD<Message> arg0, Time arg1) throws Exception {
SQLContext sqlContext = SparkConnection.getSqlContext();
DataFrame df = sqlContext.createDataFrame(arg0, Message.class);
df.registerTempTable("messages");
I got this error
/06/12 17:27:40 INFO JobScheduler: Starting job streaming job 1434110260000 ms.0 from job set of time 1434110260000 ms
15/06/12 17:27:40 ERROR JobScheduler: Error running job streaming job 1434110260000 ms.1
scala.MatchError: interface java.util.List (of class java.lang.Class)
at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1193)
at org.apache.spark.sql.SQLContext$$anonfun$getSchema$1.apply(SQLContext.scala:1192)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.sql.SQLContext.getSchema(SQLContext.scala:1192)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:437)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:465)
If Message
has many different fields like List
and the error message points to a List
match error than that is the is the issue. Also if you look at the source code you can see that List
is not in the match.
But beside digging around in the source code this is also very clearly stated in the documentation here under the Java tab :
Currently, Spark SQL does not support JavaBeans that contain nested or contain complex types such as Lists or Arrays.
You may want to switch to Scala as it seems to be supported there:
Case classes can also be nested or contain complex types such as Sequences or Arrays. This RDD can be implicitly converted to a DataFrame and then be registered as a table.
So the solution is either to use Scala or remove the List
from you JavaBean.
As a last resort you can take a look at SQLUserDefinedType to define how that List
should be persisted, maybe it's possible to hack it together.
I resolved this problem by updating my Spark version from 1.3.1
to 1.4.0
. Now, It works file.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.