df.schema 不返回 dataframe 模式的完整 StructType

Question

我有一个带有复杂嵌套列的 dataframe。 我想将所有 integer 类型转换为双精度类型，使用下面的 scala 代码：

def castIntToDouble(schema: StructType): Seq[Column] = {
  schema.fields.map { f =>
    f.dataType match {
      case IntegerType => col(f.name).cast(DoubleType)
      case StructType(_) =>
        col(f.name).cast(
          f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
        )
      case dt: ArrayType =>
        dt.elementType match {
          case IntegerType => col(f.name).cast(ArrayType(DoubleType))
          case StructType(_) =>
            col(f.name).cast(
              f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
            )
          case _ => col(f.name)
        }
      case _ => col(f.name)
    }
  }
}

df = df.select(castIntToDouble(df.schema):_*)

当我运行这段代码时，它抛出一个错误：

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '.' expecting {'SELECT', 'FROM', 'ADD'}(line 1, pos 1420)

== SQL ==
array<struct<createdAt:timestamp,sender:struct<firstName:string,lastName:string,phoneNumber:string,role:string,userId:double>,senderId:double,status:string,text:string,type:string,... 2 more fields>,issuer:struct<firstName:string,lastName:string,phoneNumber:string,role:string,userId:double>,type:string>>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------^^^

我认为是因为 part type:string,... 2 more fields导致了问题。 我在许多数据帧上运行了这段代码，它的效果非常好，但在这个 dataframe 上却不行。

我想知道如何以某种方式从df.schema中获取 StructType 的扩展版本，以防止出现n more fields 。

预先感谢您的帮助。

Answer 1

我认为问题是... 2 more fields在f.dataType.simpleString中，因为您使用的是类型的 String 表示而不是类型本身。 即使你让它工作，编译器也不会帮你发现问题，如果失败，它会在执行时失败。

也许您可以尝试使用递归 function 转换模式，然后使用新模式创建新的DataFrame 。

是这样的：

  def castIntegerToDouble(field: StructField): StructField = field.dataType match {
    case IntegerType => StructField(field.name, DoubleType, field.nullable)
    case ArrayType(basicType, containsNull) => basicType match {
      case IntegerType => StructField(field.name, ArrayType(DoubleType, containsNull), field.nullable)
      case s: StructType => StructField(field.name, ArrayType(StructType(s.map(castIntegerToDouble))), field.nullable)
      case _ => field
    }
    case s: StructType => StructField(field.name, StructType(s.map(castIntegerToDouble)), field.nullable)
    case _ => field
  }

val newSchema = StructType(df.schema.fields.map(f => castIntegerToDouble(f)))
val newDf = spark.createDataFrame(df.rdd, newSchema)

使用定义的架构，我认为默认Encoder应该能够处理从 Integer 到 Double 的转换。

df.schema 不返回 dataframe 模式的完整 StructType

问题描述

1 个解决方案

解决方案1
1 2023-01-16 16:29:22

df.schema 不返回 dataframe 模式的完整 StructType

问题描述

1 个解决方案

解决方案1 1 2023-01-16 16:29:22

解决方案1
1 2023-01-16 16:29:22