简体   繁体   English

Spark SQL表最大列数

[英]Spark SQL table max column count

I am trying to create spark SQL table by creating RDD in Scala program with column count of 200+. 我试图通过在列数为200+的Scala程序中创建RDD来创建spark SQL表。 The compilation (sbt compile) fails with java.lang.StackOverflowError exception when i create my schema as: 当我创建模式时,编译(sbt compile)失败并出现java.lang.StackOverflowError异常:

StructField("RT", StringType,nullable = true) ::
StructField("SERIALNO", StringType,nullable = true) ::
StructField("SPORDER", StringType,nullable = true) ::
// ... remaining 200+ columns

Can't paste the stacktrace as it is more than 1.5k lines 无法粘贴堆栈跟踪,因为它超过1.5k行

On reducing the column count to around 100-120 compilation succeeds. 在将列数减少到大约100-120时,编译成功。 Also, when i create a schema using schema string (splitting schema string and then creating map of it), compilation succeeds (First example under heading " Programmatically Specifying the Schema " in https://spark.apache.org/docs/1.3.0/sql-programming-guide.html ). 此外,当我使用模式字符串创建模式(拆分模式字符串然后创建它的映射)时,编译成功(第一个示例在https://spark.apache.org/docs/1.3中的 “以编程方式指定模式 ”标题下) 。 0 / sql-programming-guide.html )。

What seems to be problem with manually specifying schema which results in exception? 手动指定导致异常的模式似乎有什么问题?

The basic issue here is that you are doing a list concatenation at each step for each StructField. 这里的基本问题是您在每个StructField的每个步骤中进行列表连接。 The operator :: is actually a member of List not StructField. operator ::实际上是List not StructField的成员。 While the code reads: 虽然代码如下:

val fields = field1 :: field2 :: field3 :: Nil

This is equivalent to: 这相当于:

val fields = field1 :: (field2 :: (field3 :: Nil))

or even 甚至

val fields = Nil.::(field1).::(field2).::(field3)

So, on execution, the JVM needs to recursively evaluate the calls to the :: method. 因此,在执行时,JVM需要递归地评估对::方法的调用。 The JVM is increasing the depth of the stack in proportion to the number of items in the list. JVM正在增加堆栈的深度,与列表中的项目数成比例。 The reason that splitting a string of field names and mapping works is because it iterates through the split string of field names rather than using recursion. 拆分字符串字符串和映射的原因是因为它遍历字段名称的拆分字符串而不是使用递归。

This is not a Spark issue. 这不是Spark问题。 You can reproduce this same stack overflow error on a series of List concatenations of any type in the Scala repl once you get into the hundreds of items. 一旦进入数百个项目,您就可以在Scala repl中的任何类型的一系列List连接上重现相同的堆栈溢出错误。 Just use one of the other approaches to creating your list of StructFields that doesn't cause a stack overflow. 只需使用其他方法之一来创建不会导致堆栈溢出的StructField列表。

For example, something like this will work just fine: 例如,像这样的东西可以正常工作:

val structure = StructType(
  List(
    StructField("RT", StringType,nullable = true),
    StructField("SERIALNO", StringType,nullable = true),
    StructField("SPORDER", StringType,nullable = true),
    // Other Fields
    StructField("LASTFIELD", StringType,nullable = true)
  )
)

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM