[英]How to create Spark DataFrame from RDD[Row] when Row contains Map[Map]
This question is the continuation of this other one , where the user who gave the valid answer requested a new question to explain my further doubts.
这个问题是另一个问题的延续,给出有效答案的用户请求一个新问题来解释我的进一步疑问。
What I am trying is to generate a dataframe from a RDD[Objects]
where my objects has got primitive types, but also complex types. 我正在尝试从
RDD[Objects]
生成数据帧,其中我的对象具有原始类型,但也具有复杂类型。 In the previous questions it was explained how to parse a complex type Map. 在前面的问题中,已经解释了如何解析复杂类型Map。
What I tried next is to extrapolate the given solution to parse a Map[Map]. 我接下来尝试的是推断给定的解决方案以解析Map [Map]。 So in the DataFrame it is converted into a Array(Map).
因此,在DataFrame中将其转换为Array(Map)。
Below I give the code I have written so far: 下面我给出到目前为止已经编写的代码:
//I get an Object from Hbase here
val objectRDD : RDD[HbaseRecord] = ...
//I convert the RDD[HbaseRecord] into RDD[Row]
val rowRDD : RDD[Row] = objectRDD.map(
hbaseRecord => {
val uuid : String = hbaseRecord.uuid
val timestamp : String = hbaseRecord.timestamp
val name = Row(hbaseRecord.nameMap.firstName.getOrElse(""),
hbaseRecord.nameMap.middleName.getOrElse(""),
hbaseRecord.nameMap.lastName.getOrElse(""))
val contactsMap = hbaseRecord.contactsMap
val homeContactMap = contactsMap.get("HOME")
val homeContact = Row(homeContactMap.contactType,
homeContactMap.areaCode,
homeContactMap.number)
val workContactMap = contactsMap.get("WORK")
val workContact = Row(workContactMap.contactType,
workContactMap.areaCode,
workContactMap.number)
val contacts = Row(homeContact,workContact)
Row(uuid, timestamp, name, contacts)
}
)
//Here I define the schema
val schema = new StructType()
.add("uuid",StringType)
.add("timestamp", StringType)
.add("name", new StructType()
.add("firstName",StringType)
.add("middleName",StringType)
.add("lastName",StringType)
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
//Now I try to create a Dataframe using the RDD[Row] and the schema
val dataFrame = sqlContext.createDataFrame(rowRDD , schema)
But I am getting the following error: 但是我收到以下错误:
19/03/18 12:09:53 ERROR executor.Executor: Exception in task 0.0 in stage 1.0 (TID 8) scala.MatchError: [HOME,05,12345678] (of class org.apache.spark.sql.catalyst.expressions.GenericRow) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:295) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StringConverter$.toCatalystImpl(CatalystTypeConverters.scala:294) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalys
19/03/18 12:09:53错误executor.Executor:阶段1.0(TID 8)scala.MatchError:[HOME,05,12345678](类org.apache.spark.sql.catalyst的任务0.0中的异常)。 org.apache.spark.sql.catalyst.CatalystTypeConverters $ StringConverter $ .toCatalystTypeConverters $。(CatalystTypeConverters $ StringConverter $ .toCatalystTypeConverters $ String()。CatalystTypeIvert(CatalystTypeConverters.scala:295) 294),位于org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(260)上的org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102)。 .apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)位于org.apache.spark.sql.catalyst.CatalystTypeConverters $ CatalystTypeConverter.toCatalyst(CatalystParking.org) .sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalys tImpl(CatalystTypeConverters.scala:260) at org.apache.spark.sql.catalyst.CatalystTypeConverters$StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250) at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:102) at org.apache.spark.sql.catalyst.CatalystTypeConverters$$anonfun$createToCatalystConverter$2.apply(CatalystTypeConverters.scala:401) at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492) at org.apache.spark.sql.SQLContext$$anonfun$6.apply(SQLContext.scala:492) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48) at scala.collection.mutable.ArrayBuffer.$
org.apache.spark.sql.catalyst.CatalystTypeConverters $ StructConverter.toCatalystImpl(CatalystTypeConverters.scala:250)处的tImpl(CatalystTypeConverters.scala:260)在org.apache.spark.sql.catalyst.CatalystTypeConstant(CatalystTypeContoers $ scala:102),位于org.apache.spark.sql.SQLContext $$ anonfun $ 6.apply(SQLContext。),位于org.apache.spark.sql.catalyst.CatalystTypeConverters $$ anonfun $ createToCatalystConverter $ 2.apply(CatalystTypeConverters.scala:401)处。 scala:492)位于org.apache.spark.sql.SQLContext $$ anonfun $ 6.apply(SQLContext.scala:492)位于scala.collection.Iterator $$ anon $ 11.next(Iterator.scala:328)位于scala.collection在scala.collection处的Iterator $$ anon $ 11.next(Iterator.scala:328)在scala.collection.Iterator $$ anon $ 10.next(Iterator.scala:312)处在scala.collection.Iterator $ class.foreach(Iterator.scala:727 )在scala.collection.AbstractIterator.foreach(Iterator.scala:1157)在scala.collection.generic.Growable $ class。$ plus $ plus $ eq(Growable.scala:48)在scala.collection.mutable.ArrayBuffer。$ plus$plus$eq(ArrayBuffer.scala:103) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47) at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273) at scala.collection.AbstractIterator.to(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252) at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) at org.apache.spark.sql.execution.SparkPlan$$anonfun$5.apply(SparkPlan.scala:212) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) at org.apache.spark.scheduler.Task.run(Task.scala:89) at org.apache.s
scala.collection.mutable.ArrayBuffer处的plus $ plus $ eq(ArrayBuffer.scala:103)在scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:273)处的$ plus $ plus $ eq(ArrayBuffer.scala:47) )在scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)在scala.collection.TraversableOnce $ class.toBuffer(TraversableOnce.scala:265)在scala.collection.AbstractIterator.to(Iterator.scala:1157)在scala .collection.TraversableOnce $ class.toArray(TraversableOnce.scala:252)位于scala.collection.AbstractIterator.toArray(Iterator.scala:1157)位于org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 5.apply(SparkPlan .scala:212)在org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 5.apply(SparkPlan.scala:212)在org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala :1858),位于org.apache.spark.SparkContext $$ anonfun $ runJob $ 5.apply(SparkContext.scala:1858),位于org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)。 org.apache.s上的spark.scheduler.Task.run(Task.scala:89) park.executor.Executor$TaskRunner.run(Executor.scala:213) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)
park.executor.Executor $ TaskRunner.run(Executor.scala:213)在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java: 617)at java.lang.Thread.run(Thread.java:745)
I tried as well to generate the contacts element as an array: 我也尝试生成联系人元素作为数组:
val contacts = Array(homeContact,workContact)
But then I get the following error instead: 但是然后我得到以下错误:
scala.MatchError: [Lorg.apache.spark.sql.Row;@726c6aec (of class [Lorg.apache.spark.sql.Row;)
scala.MatchError:[Lorg.apache.spark.sql.Row; @ 726c6aec(类别为[Lorg.apache.spark.sql.Row;)
Can anyone spot the problem? 谁能发现问题?
Let's simplify your situation to your array of contacts. 让我们简化您的联系方式。 That's where the problem is.
那就是问题所在。 You are trying to use this schema:
您正在尝试使用以下架构:
val schema = new StructType()
.add("contacts", new StructType(
Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)
)))
to store a list of contacts, which is a struct type. 存储联系人列表(结构类型)。 Yet, this schema cannot contain a list, just one contact.
但是,此架构不能包含列表,只能包含一个联系人。 We can verify it with:
我们可以使用以下方法进行验证:
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema).printSchema
root
|-- contacts: struct (nullable = true)
| |-- contactType: string (nullable = true)
| |-- areaCode: string (nullable = true)
| |-- number: string (nullable = true)
Indeed, the Array
you have in your code is just meant to contain the fields of your "contacts" struct type. 确实,您在代码中拥有的
Array
只是旨在包含“联系人”结构类型的字段。
To achieve what you want, a type exists: ArrayType
. 为了实现您想要的,存在一个类型:
ArrayType
。 This yields a slightly different result: 产生的结果略有不同:
val schema_ok = new StructType()
.add("contacts", ArrayType(new StructType(Array(
StructField("contactType", StringType),
StructField("areaCode", StringType),
StructField("number", StringType)))))
spark.createDataFrame(sc.parallelize(Seq[Row]()), schema_ok).printSchema
root
|-- contacts: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- contactType: string (nullable = true)
| | |-- areaCode: string (nullable = true)
| | |-- number: string (nullable = true)
and it works: 它的工作原理是:
val row = Row(Array(
Row("type", "code", "number"),
Row("type2", "code2", "number2")))
spark.createDataFrame(sc.parallelize(Seq(row)), schema_ok).show(false)
+-------------------------------------------+
|contacts |
+-------------------------------------------+
|[[type,code,number], [type2,code2,number2]]|
+-------------------------------------------+
So if you update the schema with this version of "contacts", just replace val contacts = Row(homeContact,workContact)
by val contacts = Array(homeContact,workContact)
and it should work. 因此,如果使用此版本的“ contacts”更新架构,只需将
val contacts = Row(homeContact,workContact)
替换为val contacts = Array(homeContact,workContact)
,它应该可以工作。
NB: if you want label your contacts (with HOME or WORK), there exists a MapType
type as well. 注意:如果要标记联系人(使用HOME或WORK),则也存在
MapType
类型。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.