[英]Get the elements from different arraytype columns and build a column with heterogeneous data in Spark
I have a Spark dataframe parsed from an XML file which has data in the below format: 我有一个从XML文件解析的Spark数据帧,该文件具有以下格式的数据:
+---------+------------------------------------------------------------------------------------------+----------------------------+------------------------------------------------+
|id |a |b |c |
+---------+------------------------------------------------------------------------------------------+----------------------------+------------------------------------------------+
|191683250|[52396062, 55064266, 51149167, 53441347, 51309543, 51517728, 51543627, 68138995, 70180065]|[2, 2, 1, 3, 3, 2, 2, 27, 1]|[1.15, 0.8, 4.0, 2.49, 1.0, 2.8, 0.4, 0.49, 2.0]|
+---------+------------------------------------------------------------------------------------------+----------------------------+------------------------------------------------+
I need the output data in the format: 我需要以下格式的输出数据:
+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |a |
+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|191683250|Array[(52396062,2,1.5), (55064266,2,0.8), (51149167,1,4.0), (53441347,3,2.49), (51309543,3,1.0), (51517728,2,2.8), (51543627,2,0.4), (68138995,27,0.49), (70180065,1,2.0)]|
+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
ie., I need an Array of StructTypes/tuples. 即,我需要一个StructTypes / tuples数组。 Im just stuck on how to proceed on this.
我只是停留在如何进行此操作。
Could you please point me how can I achieve this in Spark using Scala. 您能否指出我如何使用Scala在Spark中实现这一目标。 Appreciate any help.
感谢任何帮助。
In Spark >= 2.4 this can be solved using the arrays_zip
function: 在Spark> = 2.4中,可以使用
arrays_zip
函数解决:
val df = // Example dataframe in question
val df2 = df.withColumn("a", arrays_zip($"a", $"b", $"c"))
.drop("b", "c")
For older versions of Spark, use an UDF
: 对于旧版本的Spark,请使用
UDF
:
val convertToArray = udf((a: Seq[Int], b: Seq[Int], c: Seq[Double]) => {
a zip b zip c map { case((a,b),c) => (a,b,c)}
})
val df = // Example dataframe in question
val df2 = df.withColumn("a", convertToArray($"a", $"b", $"c"))
.drop("b", "c")
The resulting dataframe: 结果数据框:
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|id |a |
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|191683250|[[52396062,2,1.15], [55064266,2,0.8], [51149167,1,4.0], [53441347,3,2.49], [51309543,3,1.0], [51517728,2,2.8], [51543627,2,0.4], [68138995,27,0.49], [70180065,1,2.0]]|
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------+
This answer is not as perfect as @Shaido's answer. 这个答案不如@Shaido的答案完美。 This answer is just a possibility of doing in another way
这个答案只是以另一种方式做的可能性
df.select($"id",
array(struct($"a"(0), $"b"(0), $"c"(0)),
struct($"a"(1), $"b"(1), $"c"(1)),
struct($"a"(2), $"b"(2), $"c"(2)),
struct($"a"(3), $"b"(3), $"c"(3)),
struct($"a"(4), $"b"(4), $"c"(4)),
struct($"a"(5), $"b"(5), $"c"(5)),
struct($"a"(6), $"b"(6), $"c"(6)),
struct($"a"(7), $"b"(7), $"c"(7))).as("a"))
.show(false)
You should be getting 你应该得到
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|id |a |
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------+
|191683250|[[52396062,2,1.15], [55064266,2,0.8], [51149167,1,4.0], [53441347,3,2.49], [51309543,3,1.0], [51517728,2,2.8], [51543627,2,0.4], [68138995,27,0.49]]|
+---------+----------------------------------------------------------------------------------------------------------------------------------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.