簡體   English   中英

從 Scala 中的 Dataframe 的嵌套結構數組中選擇幾列

[英]Select few columns from nested array of struct from a Dataframe in Scala

我有一個帶有結構數組的數據框,並且在另一個結構數組中。 有什么簡單的方法可以在不干擾整個數據幀結構的情況下選擇主數組中的少數結構以及嵌套數組中的少數結構?

簡單輸入:

-MainArray
---StructCol1
---StructCol2
---StructCol3
---SubArray
------SubArrayStruct4
------SubArrayStruct5
------SubArrayStruct6

簡單的輸出:

-MainArray
---StructCol1
---StructCol2
---SubArray
------SubArrayStruct4
------SubArrayStruct5

嘗試的源代碼如下

import org.apache.spark.sql.types.StructField
import org.apache.spark.sql.types.StructType
import org.apache.spark.sql.types.StringType
import org.apache.spark.sql.types.ArrayType
import org.apache.spark.sql.types.IntegerType

val arrayStructData = Seq(
      Row("Army",List(Row("1","Infantry","100",List(Row("Gun","Station"),Row("Bazooka","Barracks"))),Row("2","Cavalry","150",List(Row("Grenadier","Seige factory"),Row("Canon","Tank Factory"))))),
      Row("Navy",List(Row("3","Transport","200",List(Row("Cruiser","Cruise Lines"),Row("SubMarine","Yard"))),Row("4","Battle Ships","250",List(Row("Frigate","Dock"),Row("Galleon","Hub")))))
    )


val arrayStructSchema = new StructType()
      .add("Category",StringType)
      .add("ArmyOrNavy",ArrayType(new StructType()
        .add("ID",StringType)
        .add("Type",StringType)
        .add("Count",StringType)
        .add("Items",ArrayType(new StructType().add("ItemName",StringType).add("ItemTrainingArea",StringType)))
        ))


val df = spark.createDataFrame(spark.sparkContext.parallelize(arrayStructData),arrayStructSchema)
    
df.printSchema()
df.show(false)

    root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Type: string (nullable = true)
 |    |    |-- Count: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemName: string (nullable = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)

+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Category|ArmyOrNavy                                                                                                                         |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+
|Army    |[[1, Infantry, 100, [[Gun, Station], [Bazooka, Barracks]]], [2, Cavalry, 150, [[Grenadier, Seige factory], [Canon, Tank Factory]]]]|
|Navy    |[[3, Transport, 200, [[Cruiser, Cruise Lines], [SubMarine, Yard]]], [4, Battle Ships, 250, [[Frigate, Dock], [Galleon, Hub]]]]     |
+--------+-----------------------------------------------------------------------------------------------------------------------------------+

我需要的輸出是

    root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)

我嘗試做這樣的事情,但這看起來不對

val df2 = df.selectExpr("Category",
  "Array (Struct(ArmyOrNavy.ID,CAST(ArmyOrNavy.Items AS array<array<struct<ItemName:string,ItemTrainingArea:string>>>) Items))  as ArmyOrNavy")
df2.printSchema
df2.show(false)

您可以使用to_jsonfrom_json並在解析 json 時為結構字段(數組)設置新的結構DateType

val newArrayType = ArrayType(
  new StructType()
    .add("ID", StringType)
    .add("Items", ArrayType(
      new StructType()
        .add("ItemTrainingArea", StringType)
    ))
)
val jsonFieldName = "ArmyOrNavy_json"
val transformedDF = df.withColumn(jsonFieldName, to_json($"ArmyOrNavy"))
  .withColumn("ArmyOrNavy", from_json(col(jsonFieldName), newArrayType))
  .drop(jsonFieldName)
transformedDF.printSchema()
transformedDF.show(truncate = false)


// output
root
 |-- Category: string (nullable = true)
 |-- ArmyOrNavy: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- ID: string (nullable = true)
 |    |    |-- Items: array (nullable = true)
 |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |-- ItemTrainingArea: string (nullable = true)
+--------+----------------------------------------------------------------------+
|Category|ArmyOrNavy                                                            |
+--------+----------------------------------------------------------------------+
|Army    |[[1, [[Station], [Barracks]]], [2, [[Seige factory], [Tank Factory]]]]|
|Navy    |[[3, [[Cruise Lines], [Yard]]], [4, [[Dock], [Hub]]]]                 |
+--------+----------------------------------------------------------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM