[英]convert dataframe to transformed dataframe in spark scala
所以,我在 Spark 中有一個 DataFrame,它看起來像這樣:
[name,target] this is the header
[ABCD,1]
[XYZA,1]
[GFFD,1]
[NAAS,1]
[ABCD,2]
[XYZA,2]
[NAAS,2]
[VDDE,2]
我想把它轉換成這樣的數據框
[name, count(target=1), count(target=2)]
[ABCD, 1,1]
[XYZA, 1,1]
[GFFD, 1,0]
AND SO ON.....
有沒有辦法做到這一點?
這是兩種可能的解決方案。
示例輸入數據:
import spark.implicits._
val df = Seq(
("ABCD",1),
("XYZA",1),
("GFFD",1),
("NAAS",1),
("ABCD",2),
("XYZA",2),
("NAAS",2),
("VDDE",2),
("EXAMPLE", 20)
).toDF("name", "target")
df.show()
+-------+------+
| name|target|
+-------+------+
| ABCD| 1|
| XYZA| 1|
| GFFD| 1|
| NAAS| 1|
| ABCD| 2|
| XYZA| 2|
| NAAS| 2|
| VDDE| 2|
|EXAMPLE| 20|
+-------+------+
1 - 使用map
僅返回非零出現。
case class DataItem(name: String, target: Int)
df.as[DataItem]
.groupByKey(_.name)
.mapGroups{
case (nameKey, targetIter) =>{
val targetList = targetIter.map(_.target).toSeq
val occMap = targetList.groupBy(a=>a).mapValues(_.size)
(nameKey, occMap)
}
}
.toDF("name", "target_count").show()
+-------+----------------+
| name| target_count|
+-------+----------------+
| VDDE| [2 -> 1]|
| NAAS|[2 -> 1, 1 -> 1]|
|EXAMPLE| [20 -> 1]|
| GFFD| [1 -> 1]|
| XYZA|[2 -> 1, 1 -> 1]|
| ABCD|[2 -> 1, 1 -> 1]|
+-------+----------------+
2 - 使用列表顯示出現次數(包括 0),其中索引 = target_value。
case class DataItem(name: String, target: Int)
df.as[DataItem]
.groupByKey(_.name)
.mapGroups{
case (nameKey, targetIter) =>{
val targetList = targetIter.map(_.target).toSeq
val occMap = targetList.groupBy(a=>a).mapValues(_.size)
val maxTarget = occMap.maxBy(_._2)._1
val occList = for (i <- 1 until maxTarget+1) yield occMap.getOrElse(i, 0)
(nameKey, occList)
}
}
.toDF("name", "target_count").show(20, false)
+-------+------------------------------------------------------------+
|name |target_count |
+-------+------------------------------------------------------------+
|VDDE |[0, 1] |
|NAAS |[1, 1] |
|EXAMPLE|[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1]|
|GFFD |[1] |
|XYZA |[1, 1] |
|ABCD |[1, 1] |
+-------+------------------------------------------------------------+
數據框可以通過“pivot”進行轉換:
df
.groupBy("name")
.pivot("target")
.count()
// replace nulls with 0
.na.fill(0)
使用 Cesar A. Mostacero 提供的數據,結果為:
+-------+---+---+---+
|name |1 |2 |20 |
+-------+---+---+---+
|EXAMPLE|0 |0 |1 |
|XYZA |1 |1 |0 |
|GFFD |1 |0 |0 |
|VDDE |0 |1 |0 |
|ABCD |2 |1 |0 |
|NAAS |1 |1 |0 |
+-------+---+---+---+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.