從不同的arraytype列獲取元素，並在Spark中使用異構數據構建列

Question

我有一個從XML文件解析的Spark數據幀，該文件具有以下格式的數據：

+---------+------------------------------------------------------------------------------------------+----------------------------+------------------------------------------------+
|id       |a                                                                                         |b                           |c                                               |
+---------+------------------------------------------------------------------------------------------+----------------------------+------------------------------------------------+
|191683250|[52396062, 55064266, 51149167, 53441347, 51309543, 51517728, 51543627, 68138995, 70180065]|[2, 2, 1, 3, 3, 2, 2, 27, 1]|[1.15, 0.8, 4.0, 2.49, 1.0, 2.8, 0.4, 0.49, 2.0]|
+---------+------------------------------------------------------------------------------------------+----------------------------+------------------------------------------------+

我需要以下格式的輸出數據：

+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id       |a                                                                                                                                                                          |
+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|191683250|Array[(52396062,2,1.5), (55064266,2,0.8),  (51149167,1,4.0),  (53441347,3,2.49), (51309543,3,1.0), (51517728,2,2.8), (51543627,2,0.4), (68138995,27,0.49), (70180065,1,2.0)]|
+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

即，我需要一個StructTypes / tuples數組。 我只是停留在如何進行此操作。

您能否指出我如何使用Scala在Spark中實現這一目標。 感謝任何幫助。

Answer 1

在Spark> = 2.4中，可以使用arrays_zip函數解決：

val df = // Example dataframe in question
val df2 = df.withColumn("a", arrays_zip($"a", $"b", $"c"))
  .drop("b", "c")

對於舊版本的Spark，請使用UDF ：

val convertToArray = udf((a: Seq[Int], b: Seq[Int], c: Seq[Double]) => {
  a zip b zip c map { case((a,b),c) => (a,b,c)}
})

val df = // Example dataframe in question
val df2 = df.withColumn("a", convertToArray($"a", $"b", $"c"))
  .drop("b", "c")

結果數據框：

+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id       |a                                                                                                                                                                     |
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|191683250|[[52396062,2,1.15], [55064266,2,0.8], [51149167,1,4.0], [53441347,3,2.49], [51309543,3,1.0], [51517728,2,2.8], [51543627,2,0.4], [68138995,27,0.49], [70180065,1,2.0]]|
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Answer 2

這個答案不如@Shaido的答案完美。 這個答案只是以另一種方式做的可能性

df.select($"id",
  array(struct($"a"(0), $"b"(0), $"c"(0)),
  struct($"a"(1), $"b"(1), $"c"(1)),
  struct($"a"(2), $"b"(2), $"c"(2)),
  struct($"a"(3), $"b"(3), $"c"(3)),
  struct($"a"(4), $"b"(4), $"c"(4)),
  struct($"a"(5), $"b"(5), $"c"(5)),
  struct($"a"(6), $"b"(6), $"c"(6)),
  struct($"a"(7), $"b"(7), $"c"(7))).as("a"))
.show(false)

你應該得到

+---------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|id       |a                                                                                                                                                   |
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|191683250|[[52396062,2,1.15], [55064266,2,0.8], [51149167,1,4.0], [53441347,3,2.49], [51309543,3,1.0], [51517728,2,2.8], [51543627,2,0.4], [68138995,27,0.49]]|
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------+

從不同的arraytype列獲取元素，並在Spark中使用異構數據構建列

問題描述

2 個解決方案

解決方案1
4 已采納 2017-10-26 15:58:16

解決方案2
1 2017-10-26 16:16:14

從不同的arraytype列獲取元素，並在Spark中使用異構數據構建列

問題描述

2 個解決方案

解決方案1 4 已采納 2017-10-26 15:58:16

解決方案2 1 2017-10-26 16:16:14

解決方案1
4 已采納 2017-10-26 15:58:16

解決方案2
1 2017-10-26 16:16:14