[英]How to refine Spark StructType Schema based on a list of required fields?
我正在嘗試從現有架構創建 StructType 架構。 我有一個列表,其中包含新架構所需的字段。 困難的部分是模式是嵌套的 json 數據,其中包含 ArrayType(StructType) 等復雜字段。 這是模式的代碼,
val schema1: Seq[StructField] = Seq(
StructField("playerId", StringType, true),
StructField("playerName", StringType, true),
StructField("playerCountry", StringType, true),
StructField("playerBloodType", StringType, true)
)
val schema2: Seq[StructField] =
Seq(
StructField("PlayerHistory", ArrayType(
StructType(
Seq(
StructField("Rating", StringType, true),
StructField("Height", StringType, true),
StructField("Weight", StringType, true),
StructField("CoachDetails",
StructType(
Seq(
StructField("CoachName", StringType, true),
StructField("Address",
StructType(
Seq(
StructField("AddressLine1", StringType, true),
StructField("AddressLine2", StringType, true),
StructField("CoachCity", StringType, true))), true),
StructField("Suffix", StringType, true))), true),
StructField("GoalHistory", ArrayType(
StructType(
Seq(
StructField("MatchDate", StringType, true),
StructField("NumberofGoals", StringType, true),
StructField("SubstitutionIndicator", StringType, true))), true), true),
StructField("receive_date", DateType, true))
), true
)))
val requiredFields = List("playerId", "playerName", "Rating", "CoachName", "CoachCity", "MatchDate", "NumberofGoals")
val schema: StructType = StructType(schema1 ++ schema2)
變量schema是當前 schema, requiredFields包含我們需要的新 schema 的字段。 我們還需要新模式中的父塊。 output 模式應該看起來像這樣:
val outputSchema =
Seq(
StructField("playerId", StringType, true),
StructField("playerName", StringType, true),
StructField("PlayerHistory",
ArrayType(StructType(
StructField("Rating", StringType, true),
StructField("CoachDetails",
StructType(
StructField("CoachName", StringType, true),
StructField("Address", StructType(
StructField("CoachCity", StringType, true)), true),
StructField("GoalHistory", ArrayType(
StructType(
StructField("MatchDate", StringType, true),
StructField("NumberofGoals", StringType, true)), true), true)))
我嘗試使用以下代碼以遞歸方式解決問題。
schema.fields.map(f => filterSchema(f, requiredFields)).filter(_.name != "")
def filterSchema(field: StructField, requiredColumns: Seq[String]): StructField = {
field match{
case StructField(_, inner : StructType, _ ,_) => StructField(field.name,StructType(inner.fields.map(f => filterSchema(f, requiredColumns))))
case StructField(_, ArrayType(structType: StructType, _),_,_) =>
if(requiredColumns.contains(field.name))
StructField(field.name, ArrayType(StructType(structType.fields.map(f => filterSchema(f,requiredColumns))),true), true)
else
StructField("",StringType,true)
case StructField(_, _, _, _) => if(requiredColumns.contains(field.name)) field else StructField("",StringType,true)
}
}
但是,我無法過濾掉內部結構字段。
感覺可以對遞歸 function 的基本條件進行一些修改。 這里的任何幫助將不勝感激。 提前致謝。
這是我的做法,
class SchemaRefiner(schema: StructType, requiredColumns: Seq[String]) {
var FINALSCHEMA: Array[StructField] = Array[StructField]()
private def refine(schematoRefine: StructType, requiredColumns: Seq[String]): Unit = {
schematoRefine.foreach(f => {
if (requiredColumns.contains(f.name)) {
f match {
case StructField(_, inner: StructType, _, _) =>
FINALSCHEMA = FINALSCHEMA :+ f
case StructField(_, inner: StructType, _, _) =>
FINALSCHEMA = FINALSCHEMA :+ StructField(f.name, StructType(new SchemaRefiner(inner, requiredColumns).getRefinedSchema), true)
case StructField(_, ArrayType(structType: StructType, _), _, _) =>
FINALSCHEMA = FINALSCHEMA :+ StructField(f.name, ArrayType(StructType(new SchemaRefiner(structType, requiredColumns).getRefinedSchema)), true)
case StructField(_, _, , _, _) =>
FINALSCHEMA = FINALSCHEMA :+ f
}
}
})
}
def getRefinedSchema: Array[StructField] = {
refine(schema, requiredColumns)
this.FINALSCHEMA
}
}
這將遍歷結構字段,每次遇到新的結構類型時,都會遞歸調用 function 以獲得新的結構類型。
val fields = new SchemaRefiner(schema,requiredFields)
val newSchema = fields.getRefinedSchema
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.