[英]Select few columns from nested array of struct from a Dataframe in Scala
I have a dataframe with array of struct and inside that another array of struct.我有一个带有结构数组的数据框,并且在另一个结构数组中。 Any easy way to select few of the structs in the main array and also few in the nested array without disturbing the structure of the entire dataframe?
有什么简单的方法可以在不干扰整个数据帧结构的情况下选择主数组中的少数结构以及嵌套数组中的少数结构?
SIMPLE INPUT:简单输入:
-MainArray
---StructCol1
---StructCol2
---StructCol3
---SubArray
------SubArrayStruct4
------SubArrayStruct5
------SubArrayStruct6
SIMPLE OUTPUT:简单的输出:
-MainArray
---StructCol1
---StructCol2
---SubArray
------SubArrayStruct4
------SubArrayStruct5
The source code to try it is as below尝试的源代码如下
import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.types.IntegerType
val arrayStructData = Seq(
Row("Army",List(Row("1","Infantry","100",List(Row("Gun","Station"),Row("Bazooka","Barracks"))),Row("2","Cavalry","150",List(Row("Grenadier","Seige factory"),Row("Canon","Tank Factory"))))),
Row("Navy",List(Row("3","Transport","200",List(Row("Cruiser","Cruise Lines"),Row("SubMarine","Yard"))),Row("4","Battle Ships","250",List(Row("Frigate","Dock"),Row("Galleon","Hub")))))
)
val arrayStructSchema = new StructType()
.add("Category",StringType)
.add("ArmyOrNavy",ArrayType(new StructType()
.add("ID",StringType)
.add("Type",StringType)
.add("Count",StringType)
.add("Items",ArrayType(new StructType().add("ItemName",StringType).add("ItemTrainingArea",StringType)))
))
val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
df.printSchema()
df.show(false)
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Type: string (nullable = true)
| | |-- Count: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemName: string (nullable = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Army |[[1, Infantry, 100, [[Gun, Station], [Bazooka, Barracks]]], [2, Cavalry, 150, [[Grenadier, Seige factory], [Canon, Tank Factory]]]]|
|Navy |[[3, Transport, 200, [[Cruiser, Cruise Lines], [SubMarine, Yard]]], [4, Battle Ships, 250, [[Frigate, Dock], [Galleon, Hub]]]] |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
The output I need is我需要的输出是
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
I tried doing something like this but this doesn't look right我尝试做这样的事情,但这看起来不对
val df2 = df.selectExpr("Category",
"Array (Struct(ArmyOrNavy.ID,CAST(ArmyOrNavy.Items AS array<array<struct<ItemName:string,ItemTrainingArea:string>>>) Items)) as ArmyOrNavy")
df2.printSchema
df2.show(false)
You can do it using to_json
and from_json
and set new struct DateType
for struct field ( array ) while parsing json:您可以使用
to_json
和from_json
并在解析 json 时为结构字段(数组)设置新的结构DateType
:
val newArrayType = ArrayType(
new StructType()
.add("ID", StringType)
.add("Items", ArrayType(
new StructType()
.add("ItemTrainingArea", StringType)
))
)
val jsonFieldName = "ArmyOrNavy_json"
val transformedDF = df.withColumn(jsonFieldName, to_json($"ArmyOrNavy"))
.withColumn("ArmyOrNavy", from_json(col(jsonFieldName), newArrayType))
.drop(jsonFieldName)
transformedDF.printSchema()
transformedDF.show(truncate = false)
// output
root
|-- Category: string (nullable = true)
|-- ArmyOrNavy: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- ID: string (nullable = true)
| | |-- Items: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- ItemTrainingArea: string (nullable = true)
+--------+----------------------------------------------------------------------+
|Category|ArmyOrNavy |
+--------+----------------------------------------------------------------------+
|Army |[[1, [[Station], [Barracks]]], [2, [[Seige factory], [Tank Factory]]]]|
|Navy |[[3, [[Cruise Lines], [Yard]]], [4, [[Dock], [Hub]]]] |
+--------+----------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.