Working with a dataframe which contains a column, the values in the columns are lists,
id | values
1 | ['good','good','good','bad','bad','good','good']
2 | ['bad','badd','good','bad',Null,'good','bad']
....
How could I get the most frequent showed string in the list? expected output:
id | most_frequent
1 | 'good'
2 | 'bad'
....
I dont see a reason to explode
and groupby
here (compute intensive shuffle operations), as with Spark2.4+
, we can use higher order functions
to get your desired output:
from pyspark.sql import functions as F
df\
.withColumn("most_common", F.expr("""sort_array(transform(array_distinct(values),\
x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)),False)[0][1]"""))\
.show(truncate=False)
#+---+----------------------------------------+-----------+
#|id |values |most_common|
#+---+----------------------------------------+-----------+
#|1 |[good, good, good, bad, bad, good, good]|good |
#|2 |[bad, badd, good, bad,, good, bad] |bad |
#+---+----------------------------------------+-----------+
We can also use array_max
instead of sort_array
.
from pyspark.sql import functions as F
df\
.withColumn("most_common", F.expr("""array_max(transform(array_distinct(values),\
x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)))[1]"""))\
.show(truncate=False)
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.