使用来自 RDD [scala spark 2.4] 的 hiveContext.createDataFrame 的架构错误

Question

Trying to run: val outputDF = hiveContext.createDataFrame(myRDD, schema)尝试运行： val outputDF = hiveContext.createDataFrame(myRDD, schema)

Getting this error: Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct<col1name:string,col2name:string>出现此错误： Caused by: java.lang.RuntimeException: scala.Tuple2 is not a valid external type for schema of struct<col1name:string,col2name:string>

myRDD.take(5).foreach(println) myRDD.take(5).foreach(println)

[string number,[Lscala.Tuple2;@163601a5]
[1234567890,[Lscala.Tuple2;@6fa7a81c]

data of the RDD: RDD的数据：

RDD[Row]: [string number, [(string key, string value)]]
Row(string, Array(Tuple(String, String)))

where the tuple2 contains data like this:其中 tuple2 包含如下数据：

(string key, string value)

schema:架构：

schema:
root
 |-- col1name: string (nullable = true)
 |-- col2name: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- col3name: string (nullable = true)
 |    |    |-- col4name: string (nullable = true)

StructType(
    StructField(col1name,StringType,true), 
    StructField(col2name,ArrayType(
        StructType(
            StructField(col3name,StringType,true), 
            StructField(col4name,StringType,true)
            ),
        true
        ),
    true
    )
)

This code was used to run in spark 1.6 before and didn't have problems.这段代码之前在spark 1.6中运行过，没有问题。 In spark 2.4, it appears that tuple2 doesn't count as a Struct Type?在 spark 2.4 中，tuple2 似乎不算作结构类型？ In that case, what should it be changed to?这种情况下，应该改成什么？

I'm assuming the easiest solution would be to adjust the schema to suite the data.我假设最简单的解决方案是调整架构以适应数据。

Let me know if more details are needed让我知道是否需要更多详细信息

Answer 1

The answer to this is changing the tuple type that contained the 2 string types to a row containing the 2 string types instead.对此的答案是将包含 2 种字符串类型的元组类型改为包含 2 种字符串类型的行。

So for the provided schema, the incoming data structure was所以对于提供的模式，传入的数据结构是

Row(string, Array(Tuple(String, String)))

This was changed to这已更改为

Row(string, Array(Row(String, String)))

in order to continue using the same schema.为了继续使用相同的模式。

使用来自 RDD [scala spark 2.4] 的 hiveContext.createDataFrame 的架构错误

问题描述

1 个解决方案

解决方案1
0 已采纳 2021-02-03 22:37:37

使用来自 RDD [scala spark 2.4] 的 hiveContext.createDataFrame 的架构错误

问题描述

1 个解决方案

解决方案1 0 已采纳 2021-02-03 22:37:37

解决方案1
0 已采纳 2021-02-03 22:37:37