簡體   English   中英

在 spark scala 中將數據幀轉換為轉換后的數據幀

[英]convert dataframe to transformed dataframe in spark scala

所以,我在 Spark 中有一個 DataFrame,它看起來像這樣:

[name,target] this is the header
[ABCD,1]
[XYZA,1]
[GFFD,1]
[NAAS,1]
[ABCD,2]
[XYZA,2]
[NAAS,2]
[VDDE,2]

我想把它轉換成這樣的數據框

[name, count(target=1), count(target=2)]
[ABCD, 1,1]
[XYZA, 1,1]
[GFFD, 1,0]
AND SO ON.....

有沒有辦法做到這一點?

這是兩種可能的解決方案。

示例輸入數據:

import spark.implicits._
val df = Seq(
  ("ABCD",1),
  ("XYZA",1),
  ("GFFD",1),
  ("NAAS",1),
  ("ABCD",2),
  ("XYZA",2),
  ("NAAS",2),
  ("VDDE",2),
  ("EXAMPLE", 20)
).toDF("name", "target")

df.show()

+-------+------+
|   name|target|
+-------+------+
|   ABCD|     1|
|   XYZA|     1|
|   GFFD|     1|
|   NAAS|     1|
|   ABCD|     2|
|   XYZA|     2|
|   NAAS|     2|
|   VDDE|     2|
|EXAMPLE|    20|
+-------+------+

1 - 使用map僅返回非零出現。

case class DataItem(name: String, target: Int)

df.as[DataItem]
  .groupByKey(_.name)
  .mapGroups{
    case (nameKey, targetIter) =>{
     val targetList = targetIter.map(_.target).toSeq
     val occMap = targetList.groupBy(a=>a).mapValues(_.size)
      (nameKey, occMap)
    }
  }
  .toDF("name", "target_count").show()


+-------+----------------+
|   name|    target_count|
+-------+----------------+
|   VDDE|        [2 -> 1]|
|   NAAS|[2 -> 1, 1 -> 1]|
|EXAMPLE|       [20 -> 1]|
|   GFFD|        [1 -> 1]|
|   XYZA|[2 -> 1, 1 -> 1]|
|   ABCD|[2 -> 1, 1 -> 1]|
+-------+----------------+

2 - 使用列表顯示出現次數(包括 0),其中索引 = target_value。

case class DataItem(name: String, target: Int)

df.as[DataItem]
  .groupByKey(_.name)
  .mapGroups{
    case (nameKey, targetIter) =>{
       val targetList = targetIter.map(_.target).toSeq
       val occMap = targetList.groupBy(a=>a).mapValues(_.size)
       val maxTarget = occMap.maxBy(_._2)._1 
       val occList = for (i <- 1 until maxTarget+1) yield occMap.getOrElse(i, 0)

      (nameKey, occList)
    }
  }
  .toDF("name", "target_count").show(20, false)


+-------+------------------------------------------------------------+
|name   |target_count                                                |
+-------+------------------------------------------------------------+
|VDDE   |[0, 1]                                                      |
|NAAS   |[1, 1]                                                      |
|EXAMPLE|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]|
|GFFD   |[1]                                                         |
|XYZA   |[1, 1]                                                      |
|ABCD   |[1, 1]                                                      |
+-------+------------------------------------------------------------+

數據框可以通過“pivot”進行轉換:

  df
  .groupBy("name")
  .pivot("target")
  .count()
    // replace nulls with 0
  .na.fill(0)

使用 Cesar A. Mostacero 提供的數據,結果為:

+-------+---+---+---+
|name   |1  |2  |20 |
+-------+---+---+---+
|EXAMPLE|0  |0  |1  |
|XYZA   |1  |1  |0  |
|GFFD   |1  |0  |0  |
|VDDE   |0  |1  |0  |
|ABCD   |2  |1  |0  |
|NAAS   |1  |1  |0  |
+-------+---+---+---+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM