在 spark scala 中將數據幀轉換為轉換后的數據幀

Question

所以，我在 Spark 中有一個 DataFrame，它看起來像這樣：

[name,target] this is the header
[ABCD,1]
[XYZA,1]
[GFFD,1]
[NAAS,1]
[ABCD,2]
[XYZA,2]
[NAAS,2]
[VDDE,2]

我想把它轉換成這樣的數據框

[name, count(target=1), count(target=2)]
[ABCD, 1,1]
[XYZA, 1,1]
[GFFD, 1,0]
AND SO ON.....

有沒有辦法做到這一點？

Answer 1

這是兩種可能的解決方案。

示例輸入數據：

import spark.implicits._
val df = Seq(
  ("ABCD",1),
  ("XYZA",1),
  ("GFFD",1),
  ("NAAS",1),
  ("ABCD",2),
  ("XYZA",2),
  ("NAAS",2),
  ("VDDE",2),
  ("EXAMPLE", 20)
).toDF("name", "target")

df.show()

+-------+------+
|   name|target|
+-------+------+
|   ABCD|     1|
|   XYZA|     1|
|   GFFD|     1|
|   NAAS|     1|
|   ABCD|     2|
|   XYZA|     2|
|   NAAS|     2|
|   VDDE|     2|
|EXAMPLE|    20|
+-------+------+

1 - 使用map僅返回非零出現。

case class DataItem(name: String, target: Int)

df.as[DataItem]
  .groupByKey(_.name)
  .mapGroups{
    case (nameKey, targetIter) =>{
     val targetList = targetIter.map(_.target).toSeq
     val occMap = targetList.groupBy(a=>a).mapValues(_.size)
      (nameKey, occMap)
    }
  }
  .toDF("name", "target_count").show()


+-------+----------------+
|   name|    target_count|
+-------+----------------+
|   VDDE|        [2 -> 1]|
|   NAAS|[2 -> 1, 1 -> 1]|
|EXAMPLE|       [20 -> 1]|
|   GFFD|        [1 -> 1]|
|   XYZA|[2 -> 1, 1 -> 1]|
|   ABCD|[2 -> 1, 1 -> 1]|
+-------+----------------+

2 - 使用列表顯示出現次數（包括 0），其中索引 = target_value。

case class DataItem(name: String, target: Int)

df.as[DataItem]
  .groupByKey(_.name)
  .mapGroups{
    case (nameKey, targetIter) =>{
       val targetList = targetIter.map(_.target).toSeq
       val occMap = targetList.groupBy(a=>a).mapValues(_.size)
       val maxTarget = occMap.maxBy(_._2)._1 
       val occList = for (i <- 1 until maxTarget+1) yield occMap.getOrElse(i, 0)

      (nameKey, occList)
    }
  }
  .toDF("name", "target_count").show(20, false)


+-------+------------------------------------------------------------+
|name   |target_count                                                |
+-------+------------------------------------------------------------+
|VDDE   |[0, 1]                                                      |
|NAAS   |[1, 1]                                                      |
|EXAMPLE|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]|
|GFFD   |[1]                                                         |
|XYZA   |[1, 1]                                                      |
|ABCD   |[1, 1]                                                      |
+-------+------------------------------------------------------------+

Answer 2

數據框可以通過“pivot”進行轉換：

  df
  .groupBy("name")
  .pivot("target")
  .count()
    // replace nulls with 0
  .na.fill(0)

使用 Cesar A. Mostacero 提供的數據，結果為：

+-------+---+---+---+
|name   |1  |2  |20 |
+-------+---+---+---+
|EXAMPLE|0  |0  |1  |
|XYZA   |1  |1  |0  |
|GFFD   |1  |0  |0  |
|VDDE   |0  |1  |0  |
|ABCD   |2  |1  |0  |
|NAAS   |1  |1  |0  |
+-------+---+---+---+

在 spark scala 中將數據幀轉換為轉換后的數據幀

問題描述

2 個解決方案

解決方案1
1 2020-01-29 20:36:34

解決方案2
1 2020-01-30 15:21:01

在 spark scala 中將數據幀轉換為轉換后的數據幀

問題描述

2 個解決方案

解決方案1 1 2020-01-29 20:36:34

解決方案2 1 2020-01-30 15:21:01

解決方案1
1 2020-01-29 20:36:34

解決方案2
1 2020-01-30 15:21:01