繁体   English   中英

从 Scala 中的 Dataframe 的嵌套结构数组中选择几列

[英]Select few columns from nested array of struct from a Dataframe in Scala

我有一个带有结构数组的数据框,并且在另一个结构数组中。 有什么简单的方法可以在不干扰整个数据帧结构的情况下选择主数组中的少数结构以及嵌套数组中的少数结构?

简单输入:

-MainArray
---StructCol1
---StructCol2
---StructCol3
---SubArray
------SubArrayStruct4
------SubArrayStruct5
------SubArrayStruct6

简单的输出:

-MainArray
---StructCol1
---StructCol2
---SubArray
------SubArrayStruct4
------SubArrayStruct5

尝试的源代码如下

import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.types.IntegerType

val arrayStructData = Seq(
      Row("Army",List(Row("1","Infantry","100",List(Row("Gun","Station"),Row("Bazooka","Barracks"))),Row("2","Cavalry","150",List(Row("Grenadier","Seige factory"),Row("Canon","Tank Factory"))))),
      Row("Navy",List(Row("3","Transport","200",List(Row("Cruiser","Cruise Lines"),Row("SubMarine","Yard"))),Row("4","Battle Ships","250",List(Row("Frigate","Dock"),Row("Galleon","Hub")))))
    )


val arrayStructSchema = new StructType()
      .add("Category",StringType)
      .add("ArmyOrNavy",ArrayType(new StructType()
        .add("ID",StringType)
        .add("Type",StringType)
        .add("Count",StringType)
        .add("Items",ArrayType(new StructType().add("ItemName",StringType).add("ItemTrainingArea",StringType)))
        ))


val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
    
df.printSchema()
df.show(false)

    root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- Count: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemName: string (nullable = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)

+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Category|ArmyOrNavy                                                                                                                         |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Army    |[[1, Infantry, 100, [[Gun, Station], [Bazooka, Barracks]]], [2, Cavalry, 150, [[Grenadier, Seige factory], [Canon, Tank Factory]]]]|
|Navy    |[[3, Transport, 200, [[Cruiser, Cruise Lines], [SubMarine, Yard]]], [4, Battle Ships, 250, [[Frigate, Dock], [Galleon, Hub]]]]     |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+

我需要的输出是

    root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)

我尝试做这样的事情,但这看起来不对

val df2 = df.selectExpr("Category",
  "Array (Struct(ArmyOrNavy.ID,CAST(ArmyOrNavy.Items AS array<array<struct<ItemName:string,ItemTrainingArea:string>>>) Items))  as ArmyOrNavy")
df2.printSchema
df2.show(false)

您可以使用to_jsonfrom_json并在解析 json 时为结构字段(数组)设置新的结构DateType

val newArrayType = ArrayType(
  new StructType()
    .add("ID", StringType)
    .add("Items", ArrayType(
      new StructType()
        .add("ItemTrainingArea", StringType)
    ))
)
val jsonFieldName = "ArmyOrNavy_json"
val transformedDF = df.withColumn(jsonFieldName, to_json($"ArmyOrNavy"))
  .withColumn("ArmyOrNavy", from_json(col(jsonFieldName), newArrayType))
  .drop(jsonFieldName)
transformedDF.printSchema()
transformedDF.show(truncate = false)


// output
root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)
+--------+----------------------------------------------------------------------+
|Category|ArmyOrNavy                                                            |
+--------+----------------------------------------------------------------------+
|Army    |[[1, [[Station], [Barracks]]], [2, [[Seige factory], [Tank Factory]]]]|
|Navy    |[[3, [[Cruise Lines], [Yard]]], [4, [[Dock], [Hub]]]]                 |
+--------+----------------------------------------------------------------------+

暂无
暂无

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM