簡體   English   中英

Spark-將平面數據框映射到可配置的嵌套json模式

[英]Spark - Map flat dataframe to a configurable nested json schema

我有一個5-6列的平面數據框。 我想嵌套它們並將其轉換為嵌套的數據框,以便隨后將其寫入拼花格式。

但是,我不想使用案例類,因為我試圖使代碼盡可能地可配置。 我堅持這一部分,需要一些幫助。

我的輸入:

ID ID-2 Count(apple) Count(banana) Count(potato) Count(Onion)

 1  23    1             0             2             0

 2  23    0             1             0             1

 2  29    1             0             1             0

我的輸出:

第1行:

{
  "id": 1,
  "ID-2": 23,
  "fruits": {
    "count of apple": 1,
    "count of banana": 0
  },
  "vegetables": {
    "count of potato": 2,
    "count of onion": 0
  }
} 

我嘗試在spark數據框中使用“映射”功能,將值映射到case類。 但是,我將使用這些字段的名稱,並且可能也會更改它們。

我不想維護一個case類並將行映射到sql列名,因為這每次都會涉及代碼更改。

我正在考慮使用要與數據框的列名保持一致的列名維護一個Hashmap。 例如,在示例中,我將“ Count(apple)”映射到“ count of apple”。 但是,我想不出一種簡單的好方法來將架構作為配置傳遞,然后將其映射到我的代碼中

這是一種使用scala Map類型使用以下數據集創建列映射的方法:

val data = Seq(
(1, 23, 1, 0, 2, 0),
(2, 23, 0, 1, 0, 1),
(2, 29, 1, 0, 1, 0)).toDF("ID", "ID-2", "count(apple)", "count(banana)", "count(potato)", "count(onion)")

首先,我們使用scala.collection.immutable.Map集合以及負責映射的函數來聲明映射:

import org.apache.spark.sql.{Column, DataFrame}

val colMapping = Map(
        "count(banana)" -> "no of banana", 
        "count(apple)" -> "no of apples", 
        "count(potato)" -> "no of potatos", 
        "count(onion)" -> "no of onions")

def mapColumns(colsMapping: Map[String, String], df: DataFrame) : DataFrame = {
       val mapping = df
         .columns
         .map{ c => if (colsMapping.contains(c)) df(c).alias(colsMapping(c)) else df(c)}
         .toList

        df.select(mapping:_*)
}

該函數循環盡管給出的數據框的列和標識具有與普通鍵的列mapping 然后,它根據應用的映射返回更改名稱(帶有別名)的列。

mapColumns(colMapping, df).show(false)

+---+----+------------+------------+-------------+------------+
|ID |ID-2|no of apples|no of banana|no of potatos|no of onions|
+---+----+------------+------------+-------------+------------+
|1  |23  |1           |0           |2            |0           |
|2  |23  |0           |1           |0            |1           |
|2  |29  |1           |0           |1            |0           |
+---+----+------------+------------+-------------+------------+

最后,我們通過struct類型生成水果和蔬菜:

df1.withColumn("fruits", struct(col(colMapping("count(banana)")), col(colMapping("count(apple)"))))
.withColumn("vegetables", struct(col(colMapping("count(potato)")), col(colMapping("count(onion)"))))
.drop(colMapping.values.toList:_*)
.toJSON
.show(false)

請注意,完成轉換后,我們將刪除colMapping集合的所有cols。

輸出:

+-----------------------------------------------------------------------------------------------------------------+
|value                                                                                                            |
+-----------------------------------------------------------------------------------------------------------------+
|{"ID":1,"ID-2":23,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":2,"no of onions":0}}|
|{"ID":2,"ID-2":23,"fruits":{"no of banana":1,"no of apples":0},"vegetables":{"no of potatos":0,"no of onions":1}}|
|{"ID":2,"ID-2":29,"fruits":{"no of banana":0,"no of apples":1},"vegetables":{"no of potatos":1,"no of onions":0}}|
+-----------------------------------------------------------------------------------------------------------------+

Scala中的::(雙冒號)在Scala列表中被視為“缺點”。 這是創建Scala列表或將元素插入現有可變列表的方式。

scala> val aList = 24 :: 34 :: 56 :: Nil
aList: List[Int] = List(24, 34, 56)

scala> 99 :: aList
res3: List[Int] = List(99, 24, 34, 56)

在第一個示例中,Nil是空列表,並且被視為最右邊的cons操作的結尾。

然而

scala> val anotherList = 23 :: 34
<console>:12: error: value :: is not a member of Int
       val anotherList = 23 :: 34

因為沒有要插入的現有列表,所以將引發錯誤。

val df = spark.sqlContext.read.option("header","true").csv("/sampleinput.txt")

val df1 = df.withColumn("fruits",struct("Count(apple)","Count(banana)") ).withColumn("vegetables",struct("Count(potato)","Count(Onion)")).groupBy("ID","ID-2").agg(collect_list("fruits") as "fruits",collect_list("vegetables") as "vegetables").toJSON 

df1.take(1)

輸出:

{"ID":"2","ID-2":"23","fruits":[{"Count(apple)":"0","Count(banana)":"1"}],"vegetables":[{"Count(potato)":"0","Count(Onion)":"1"}]}

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM