简体   繁体   中英

How to find the mean value of a array column and then subtract the mean from each element in a pyspark dataframe?

Here is the list: This is a dataframe in pyspark

id list1 list2
1 [10, 20, 30] [30, 40, 50]
2 [35, 65, 85] [15, 5, 45]

This is the desired output. Calculate the mean of each list and subtract the mean value from each element in the list. I'm using pyspark for this.

id list1 list2
1 [10-mean, 20-mean, 30-mean] [30-mean, 40-mean, 50-mean]
2 [35-mean, 65-mean, 85-mean] [15-mean, 5-mean, 45-mean]

You can use aggregate to calculate the mean value for each list, then using transform functions on the array columns to subtract the mean for each element:

from pyspark.sql import functions as F

df1 = df.withColumn("list1_avg", F.expr("aggregate(list1, bigint(0), (acc, x) -> acc + x, acc -> acc / size(list1))")) \
    .withColumn("list2_avg", F.expr("aggregate(list2, bigint(0), (acc, x) -> acc + x, acc -> acc / size(list2))")) \
    .withColumn("list1", F.expr("transform(list1, x -> x - list1_avg)")) \
    .withColumn("list2", F.expr("transform(list2, x -> x - list2_avg)")) \
    .drop("list1_avg", "list2_avg")

df1.show(truncate=False)

#+---+-------------------------------------------------------------+-------------------------------------------------------------+
#|id |list1                                                        |list2                                                        |
#+---+-------------------------------------------------------------+-------------------------------------------------------------+
#|1  |[-10.0, 0.0, 10.0]                                           |[-10.0, 0.0, 10.0]                                           |
#|2  |[-26.666666666666664, 3.3333333333333357, 23.333333333333336]|[-6.666666666666668, -16.666666666666668, 23.333333333333332]|
#+---+-------------------------------------------------------------+-------------------------------------------------------------+

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM