简体   繁体   中英

How to find sum of arrays in a column which is grouped by another column values in a spark dataframe using scala

I have a dataframe like below

 c1 Value A Array[47,97,33,94,6] A Array[59,98,24,83,3] A Array[77,63,93,86,62] B Array[86,71,72,23,27] B Array[74,69,72,93,7] B Array[58,99,90,93,41] C Array[40,13,85,75,90] C Array[39,13,33,29,14] C Array[99,88,57,69,49]

I need an output as below.

c1             Value
A             Array[183,258,150,263,71]
B             Array[218,239,234,209,75]
C             Array[178,114,175,173,153]

Which is nothing but grouping column c1 and find the sum of values in column value in a sequential manner . Please help, I couldn't find any way of doing this in google .

It is not very complicated. As you mention it, you can simply group by "c1" and aggregate the values of the array index by index.

Let's first generate some data:

val df = spark.range(6)
    .select('id % 3 as "c1",
            array((1 to 5).map(_ => floor(rand * 10)) : _*) as "Value")
df.show()
+---+---------------+
| c1|          Value|
+---+---------------+
|  0|[7, 4, 7, 4, 0]|
|  1|[3, 3, 2, 8, 5]|
|  2|[2, 1, 0, 4, 4]|
|  0|[0, 4, 2, 1, 8]|
|  1|[1, 5, 7, 4, 3]|
|  2|[2, 5, 0, 2, 2]|
+---+---------------+

Then we need to iterate over the values of the array so as to aggregate them. It is very similar to the way we created them:

val n = 5 // if you know the size of the array
val n = df.select(size('Value)).first.getAs[Int](0) // If you do not
df
    .groupBy("c1")
    .agg(array((0 until n).map(i => sum(col("Value").getItem(i))) :_* ) as "Value")
    .show()
+---+------------------+
| c1|             Value|
+---+------------------+
|  0|[11, 18, 15, 8, 9]|
|  1|  [2, 10, 5, 7, 4]|
|  2|[7, 14, 15, 10, 4]|
+---+------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM