[英]Spark Scala update dataframe
我有這樣的問題:
val data = Seq(("TIM", "FIRST", "A", 1),
("BIM", "SECOND", "A", 2),
("JIM", "THIRD", "B", 1)).toDF("NAME", "POSITION", "GROUP", "INDEX")
data.show()
data.printSchema()
val title = Seq(("A", "MASTER"), ("B", "TEACHER"),
("C", "STUDENT")).toDF("LETTER", "DEGREE")
title.show()
title.printSchema()
+----+--------+-----+-----+
|NAME|POSITION|GROUP|INDEX|
+----+--------+-----+-----+
| TIM| FIRST| A| 1|
| BIM| SECOND| A| 2|
| JIM| THIRD| B| 1|
+----+--------+-----+-----+
root
|-- NAME: string (nullable = true)
|-- POSITION: string (nullable = true)
|-- GROUP: string (nullable = true)
|-- INDEX: integer (nullable = false)
+------+-------+
|LETTER| DEGREE|
+------+-------+
| A| MASTER|
| B|TEACHER|
| C|STUDENT|
+------+-------+
root
|-- LETTER: string (nullable = true)
|-- DEGREE: string (nullable = true)
//Final result
+----+--------+-------+--'--+
|NAME|POSITION| GROUP|INDEX|
+----+--------+-------+-----+
| TIM| FIRST| MASTER| 1 |
| BIM| SECOND| A| 2 |
| JIM| THIRD|TEACHER| 1 |
+----+--------+-------+-----+
我嘗試了幾件事:
val result = data.withColumn("GROUP", when('INDEX === 1, ???????????))
問號在哪里我嘗試調用 UDF 但我無法從 GROUP 獲取當前行值作為參數傳遞給 UDF。 還嘗試將 select 放在 TITLE 和 GROUP = LETTER 中,但沒有任何效果。
首先 dataframe 很大,其他的產量很小。
是否有一些優雅的方式沒有先加入它們然后加入 withColumn ?
謝謝
使用廣播連接:
data
.join(broadcast(title),$"GROUP"===$"LETTER")
.withColumn("GROUP",when($"INDEX"=== 1,$"DEGREE").otherwise($"GROUP"))
.drop("LETTER","DEGREE")
.show()
+----+--------+-------+-----+
|NAME|POSITION| GROUP|INDEX|
+----+--------+-------+-----+
| TIM| FIRST| MASTER| 1|
| BIM| SECOND| A| 2|
| JIM| THIRD|TEACHER| 1|
+----+--------+-------+-----+
您還可以收集查找地圖的title
,廣播此 map 並使用 UDF,但與廣播連接相比確實沒有優勢
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.