Spark中的動態列選擇（基於另一列的值）

Question

使用給定的Spark DataFrame：

> df.show()

+---+-----+---+---+---+---+
| id|delay| p1| p2| p3| p4|
+---+-----+---+---+---+---+
|  1|    3|  a|  b|  c|  d|
|  2|    1|  m|  n|  o|  p|
|  3|    2|  q|  r|  s|  t|
+---+-----+---+---+---+---+

如何動態選擇一列，以便新的col列是p{delay}現有列的結果？

> df.withColumn("col", /* ??? */).show()

+---+-----+---+---+---+---+----+
| id|delay| p1| p2| p3| p4| col|
+---+-----+---+---+---+---+----+
|  1|    3|  a|  b|  c|  d|   c|   // col = p3
|  2|    1|  m|  n|  o|  p|   m|   // col = p1
|  3|    2|  q|  r|  s|  t|   r|   // col = p2
+---+-----+---+---+---+---+----+

Answer 1

我能想到的最簡單的解決方案是使用delay array作為索引：

import org.apache.spark.sql.functions.array

df.withColumn("col", array($"p1", $"p2", $"p3", $"p4")($"delay" - 1))

Answer 2

一種選擇是創建一個從數字到列名的映射，然后使用foldLeft用相應的值更新col列：

val cols = (1 to 4).map(i => i -> s"p$i")

(cols.foldLeft(df.withColumn("col", lit(null))){ 
   case (df, (k, v)) => df.withColumn("col", when(df("delay") === k, df(v)).otherwise(df("col"))) 
}).show
+---+-----+---+---+---+---+---+    
| id|delay| p1| p2| p3| p4|col|
+---+-----+---+---+---+---+---+
|  1|    3|  a|  b|  c|  d|  c|
|  2|    1|  m|  n|  o|  p|  m|
|  3|    2|  q|  r|  s|  t|  r|
+---+-----+---+---+---+---+---+

Spark中的動態列選擇（基於另一列的值）

問題描述

2 個解決方案

解決方案1
8 已采納 2017-09-22 14:02:22

解決方案2
1 2017-09-22 13:56:58

Spark中的動態列選擇（基於另一列的值）

問題描述

2 個解決方案

解決方案1 8 已采納 2017-09-22 14:02:22

解決方案2 1 2017-09-22 13:56:58

解決方案1
8 已采納 2017-09-22 14:02:22

解決方案2
1 2017-09-22 13:56:58