[英]Select few columns from nested array of struct from a Dataframe in Scala
我有一个带有结构数组的数据框,并且在另一个结构数组中。 有什么简单的方法可以在不干扰整个数据帧结构的情况下选择主数组中的少数结构以及嵌套数组中的少数结构?
简单输入:
-MainArray
---StructCol1
---StructCol2
---StructCol3
---SubArray
------SubArrayStruct4
------SubArrayStruct5
------SubArrayStruct6
简单的输出:
-MainArray
---StructCol1
---StructCol2
---SubArray
------SubArrayStruct4
------SubArrayStruct5
尝试的源代码如下
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.types.IntegerType
val arrayStructData = Seq(
Row("Army",List(Row("1","Infantry","100",List(Row("Gun","Station"),Row("Bazooka","Barracks"))),Row("2","Cavalry","150",List(Row("Grenadier","Seige factory"),Row("Canon","Tank Factory"))))),
Row("Navy",List(Row("3","Transport","200",List(Row("Cruiser","Cruise Lines"),Row("SubMarine","Yard"))),Row("4","Battle Ships","250",List(Row("Frigate","Dock"),Row("Galleon","Hub")))))
)
val arrayStructSchema = new StructType()
.add("Category",StringType)
.add("ArmyOrNavy",ArrayType(new StructType()
.add("ID",StringType)
.add("Type",StringType)
.add("Count",StringType)
.add("Items",ArrayType(new StructType().add("ItemName",StringType).add("ItemTrainingArea",StringType)))
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
df.printSchema()
df.show(false)
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Count: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemName: string (nullable = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Army |[[1, Infantry, 100, [[Gun, Station], [Bazooka, Barracks]]], [2, Cavalry, 150, [[Grenadier, Seige factory], [Canon, Tank Factory]]]]|
|Navy |[[3, Transport, 200, [[Cruiser, Cruise Lines], [SubMarine, Yard]]], [4, Battle Ships, 250, [[Frigate, Dock], [Galleon, Hub]]]] |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
我需要的输出是
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
我尝试做这样的事情,但这看起来不对
val df2 = df.selectExpr("Category",
"Array (Struct(ArmyOrNavy.ID,CAST(ArmyOrNavy.Items AS array<array<struct<ItemName:string,ItemTrainingArea:string>>>) Items)) as ArmyOrNavy")
df2.printSchema
df2.show(false)
您可以使用to_json
和from_json
并在解析 json 时为结构字段(数组)设置新的结构DateType
:
val newArrayType = ArrayType(
new StructType()
.add("ID", StringType)
.add("Items", ArrayType(
new StructType()
.add("ItemTrainingArea", StringType)
))
)
val jsonFieldName = "ArmyOrNavy_json"
val transformedDF = df.withColumn(jsonFieldName, to_json($"ArmyOrNavy"))
.withColumn("ArmyOrNavy", from_json(col(jsonFieldName), newArrayType))
.drop(jsonFieldName)
transformedDF.printSchema()
transformedDF.show(truncate = false)
// output
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+----------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+----------------------------------------------------------------------+
|Army |[[1, [[Station], [Barracks]]], [2, [[Seige factory], [Tank Factory]]]]|
|Navy |[[3, [[Cruise Lines], [Yard]]], [4, [[Dock], [Hub]]]] |
+--------+----------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.