简体   繁体   English

从 Scala 中的 Dataframe 的嵌套结构数组中选择几列

[英]Select few columns from nested array of struct from a Dataframe in Scala

I have a dataframe with array of struct and inside that another array of struct.我有一个带有结构数组的数据框,并且在另一个结构数组中。 Any easy way to select few of the structs in the main array and also few in the nested array without disturbing the structure of the entire dataframe?有什么简单的方法可以在不干扰整个数据帧结构的情况下选择主数组中的少数结构以及嵌套数组中的少数结构?

SIMPLE INPUT:简单输入:

-MainArray
---StructCol1
---StructCol2
---StructCol3
---SubArray
------SubArrayStruct4
------SubArrayStruct5
------SubArrayStruct6

SIMPLE OUTPUT:简单的输出:

-MainArray
---StructCol1
---StructCol2
---SubArray
------SubArrayStruct4
------SubArrayStruct5

The source code to try it is as below尝试的源代码如下

import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.types.IntegerType

val arrayStructData = Seq(
      Row("Army",List(Row("1","Infantry","100",List(Row("Gun","Station"),Row("Bazooka","Barracks"))),Row("2","Cavalry","150",List(Row("Grenadier","Seige factory"),Row("Canon","Tank Factory"))))),
      Row("Navy",List(Row("3","Transport","200",List(Row("Cruiser","Cruise Lines"),Row("SubMarine","Yard"))),Row("4","Battle Ships","250",List(Row("Frigate","Dock"),Row("Galleon","Hub")))))
    )


val arrayStructSchema = new StructType()
      .add("Category",StringType)
      .add("ArmyOrNavy",ArrayType(new StructType()
        .add("ID",StringType)
        .add("Type",StringType)
        .add("Count",StringType)
        .add("Items",ArrayType(new StructType().add("ItemName",StringType).add("ItemTrainingArea",StringType)))
        ))


val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
    
df.printSchema()
df.show(false)

    root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- Count: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemName: string (nullable = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)

+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Category|ArmyOrNavy                                                                                                                         |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Army    |[[1, Infantry, 100, [[Gun, Station], [Bazooka, Barracks]]], [2, Cavalry, 150, [[Grenadier, Seige factory], [Canon, Tank Factory]]]]|
|Navy    |[[3, Transport, 200, [[Cruiser, Cruise Lines], [SubMarine, Yard]]], [4, Battle Ships, 250, [[Frigate, Dock], [Galleon, Hub]]]]     |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+

The output I need is我需要的输出是

    root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)

I tried doing something like this but this doesn't look right我尝试做这样的事情,但这看起来不对

val df2 = df.selectExpr("Category",
  "Array (Struct(ArmyOrNavy.ID,CAST(ArmyOrNavy.Items AS array<array<struct<ItemName:string,ItemTrainingArea:string>>>) Items))  as ArmyOrNavy")
df2.printSchema
df2.show(false)

You can do it using to_json and from_json and set new struct DateType for struct field ( array ) while parsing json:您可以使用to_jsonfrom_json并在解析 json 时为结构字段(数组)设置新的结构DateType

val newArrayType = ArrayType(
  new StructType()
    .add("ID", StringType)
    .add("Items", ArrayType(
      new StructType()
        .add("ItemTrainingArea", StringType)
    ))
)
val jsonFieldName = "ArmyOrNavy_json"
val transformedDF = df.withColumn(jsonFieldName, to_json($"ArmyOrNavy"))
  .withColumn("ArmyOrNavy", from_json(col(jsonFieldName), newArrayType))
  .drop(jsonFieldName)
transformedDF.printSchema()
transformedDF.show(truncate = false)


// output
root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)
+--------+----------------------------------------------------------------------+
|Category|ArmyOrNavy                                                            |
+--------+----------------------------------------------------------------------+
|Army    |[[1, [[Station], [Barracks]]], [2, [[Seige factory], [Tank Factory]]]]|
|Navy    |[[3, [[Cruise Lines], [Yard]]], [4, [[Dock], [Hub]]]]                 |
+--------+----------------------------------------------------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM