[英]How to populate a Spark DataFrame column based on another column's value?
[英]Dynamic column selection in Spark (based on another column's value)
使用給定的Spark DataFrame:
> df.show()
+---+-----+---+---+---+---+
| id|delay| p1| p2| p3| p4|
+---+-----+---+---+---+---+
| 1| 3| a| b| c| d|
| 2| 1| m| n| o| p|
| 3| 2| q| r| s| t|
+---+-----+---+---+---+---+
如何動態選擇一列,以便新的col
列是p{delay}
現有列的結果?
> df.withColumn("col", /* ??? */).show()
+---+-----+---+---+---+---+----+
| id|delay| p1| p2| p3| p4| col|
+---+-----+---+---+---+---+----+
| 1| 3| a| b| c| d| c| // col = p3
| 2| 1| m| n| o| p| m| // col = p1
| 3| 2| q| r| s| t| r| // col = p2
+---+-----+---+---+---+---+----+
我能想到的最簡單的解決方案是使用delay
array
作為索引:
import org.apache.spark.sql.functions.array
df.withColumn("col", array($"p1", $"p2", $"p3", $"p4")($"delay" - 1))
一種選擇是創建一個從數字到列名的映射,然后使用foldLeft用相應的值更新col
列:
val cols = (1 to 4).map(i => i -> s"p$i")
(cols.foldLeft(df.withColumn("col", lit(null))){
case (df, (k, v)) => df.withColumn("col", when(df("delay") === k, df(v)).otherwise(df("col")))
}).show
+---+-----+---+---+---+---+---+
| id|delay| p1| p2| p3| p4|col|
+---+-----+---+---+---+---+---+
| 1| 3| a| b| c| d| c|
| 2| 1| m| n| o| p| m|
| 3| 2| q| r| s| t| r|
+---+-----+---+---+---+---+---+
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.