[英]Select few columns from nested array of struct from a Dataframe in Scala
我有一個帶有結構數組的數據框,並且在另一個結構數組中。 有什么簡單的方法可以在不干擾整個數據幀結構的情況下選擇主數組中的少數結構以及嵌套數組中的少數結構?
簡單輸入:
-MainArray
---StructCol1
---StructCol2
---StructCol3
---SubArray
------SubArrayStruct4
------SubArrayStruct5
------SubArrayStruct6
簡單的輸出:
-MainArray
---StructCol1
---StructCol2
---SubArray
------SubArrayStruct4
------SubArrayStruct5
嘗試的源代碼如下
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.types.IntegerType
val arrayStructData = Seq(
Row("Army",List(Row("1","Infantry","100",List(Row("Gun","Station"),Row("Bazooka","Barracks"))),Row("2","Cavalry","150",List(Row("Grenadier","Seige factory"),Row("Canon","Tank Factory"))))),
Row("Navy",List(Row("3","Transport","200",List(Row("Cruiser","Cruise Lines"),Row("SubMarine","Yard"))),Row("4","Battle Ships","250",List(Row("Frigate","Dock"),Row("Galleon","Hub")))))
)
val arrayStructSchema = new StructType()
.add("Category",StringType)
.add("ArmyOrNavy",ArrayType(new StructType()
.add("ID",StringType)
.add("Type",StringType)
.add("Count",StringType)
.add("Items",ArrayType(new StructType().add("ItemName",StringType).add("ItemTrainingArea",StringType)))
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
df.printSchema()
df.show(false)
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Count: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemName: string (nullable = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Army |[[1, Infantry, 100, [[Gun, Station], [Bazooka, Barracks]]], [2, Cavalry, 150, [[Grenadier, Seige factory], [Canon, Tank Factory]]]]|
|Navy |[[3, Transport, 200, [[Cruiser, Cruise Lines], [SubMarine, Yard]]], [4, Battle Ships, 250, [[Frigate, Dock], [Galleon, Hub]]]] |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
我需要的輸出是
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
我嘗試做這樣的事情,但這看起來不對
val df2 = df.selectExpr("Category",
"Array (Struct(ArmyOrNavy.ID,CAST(ArmyOrNavy.Items AS array<array<struct<ItemName:string,ItemTrainingArea:string>>>) Items)) as ArmyOrNavy")
df2.printSchema
df2.show(false)
您可以使用to_json
和from_json
並在解析 json 時為結構字段(數組)設置新的結構DateType
:
val newArrayType = ArrayType(
new StructType()
.add("ID", StringType)
.add("Items", ArrayType(
new StructType()
.add("ItemTrainingArea", StringType)
))
)
val jsonFieldName = "ArmyOrNavy_json"
val transformedDF = df.withColumn(jsonFieldName, to_json($"ArmyOrNavy"))
.withColumn("ArmyOrNavy", from_json(col(jsonFieldName), newArrayType))
.drop(jsonFieldName)
transformedDF.printSchema()
transformedDF.show(truncate = false)
// output
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+----------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+----------------------------------------------------------------------+
|Army |[[1, [[Station], [Barracks]]], [2, [[Seige factory], [Tank Factory]]]]|
|Navy |[[3, [[Cruise Lines], [Yard]]], [4, [[Dock], [Hub]]]] |
+--------+----------------------------------------------------------------------+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.