[英]How to change all columns data types in StructType or ArrayType columns?
我有一个 DataFrame 包括一些带有StructType
和ArrayType
的列。 我想将所有IntegerType
列转换为DoubleType
。 我找到了一些解决这个问题的方法。 例如,这个答案的作用与我想要的相似。 但问题是,它不会更改嵌套在StructType
或ArrayType
列中的列的数据类型。
例如,我有一个具有以下架构的 DataFrame:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: integer (nullable = true)
|-- percentage: integer (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
执行以下脚本后:
val doubleSchema = df.schema.fields.map{f =>
f match{
case StructField(name:String, _:IntegerType, _, _) => col(name).cast(DoubleType)
case _ => col(f.name)
}
}
df.select(doubleSchema:_*).printSchema
结果是这样的:
|-- carCategories: array (nullable = true)
| |-- element: integer (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: integer (nullable = true)
| |-- min: integer (nullable = true)
如您所见,某些列已转换为DoubleType
,但ArrayType
和StructType
中的列未转换。
我希望最终架构是这样的:
|-- carCategories: array (nullable = true)
| |-- element: double (containsNull = true)
|-- payerId: double (nullable = true)
|-- percentage: double (nullable = true)
|-- plateNumberStatus: string (nullable = true)
|-- ratio: struct (nullable = true)
| |-- max: double (nullable = true)
| |-- min: double (nullable = true)
我怎样才能做到这一点?
先感谢您
您可以添加 case 子句来处理ArrayType
和StructType
,如下所示:
def castIntToDouble(schema: StructType): Seq[Column] = {
schema.fields.map { f =>
f.dataType match {
case IntegerType => col(f.name).cast(DoubleType)
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}", s":${DoubleType.simpleString}")
)
case dt: ArrayType =>
dt.elementType match {
case IntegerType => col(f.name).cast(ArrayType(DoubleType))
case StructType(_) =>
col(f.name).cast(
f.dataType.simpleString.replace(s":${IntegerType.simpleString}",s":${DoubleType.simpleString}")
)
case _ => col(f.name)
}
case _ => col(f.name)
}
}
}
当列类型为StructType
或嵌套结构数组时,function 使用DLL
字符串格式进行强制转换。 例如,如果您必须强制转换类型为struct<max:int,min:int>
的结构列ratio
,而不必重新创建您要做的整个结构:
df.withColumn("ratio", col("ratio").cast("struct<max:double,min:double>"))
现在将其应用于您的输入示例:
val df = (
Seq((Seq(1, 2, 3), 34, 87, "pending", (65, 22)))
.toDF("carCategories","payerId","percentage","plateNumberStatus","ratio")
.withColumn("ratio", col("ratio").cast("struct<max:int,min:int>"))
)
df.select(castIntToDouble(df.schema):_*).printSchema
//root
// |-- carCategories: array (nullable = true)
// | |-- element: double (containsNull = true)
// |-- payerId: double (nullable = false)
// |-- percentage: double (nullable = false)
// |-- plateNumberStatus: string (nullable = true)
// |-- ratio: struct (nullable = true)
// | |-- max: double (nullable = true)
// | |-- min: double (nullable = true)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.