简体   繁体   中英

In Pyspark get most frequent string from a column with list of strings

Working with a dataframe which contains a column, the values in the columns are lists,

id    |   values
1     |   ['good','good','good','bad','bad','good','good']
2     |   ['bad','badd','good','bad',Null,'good','bad']
....

How could I get the most frequent showed string in the list? expected output:

id   | most_frequent
1    | 'good'
2    | 'bad'
....

I dont see a reason to explode and groupby here (compute intensive shuffle operations), as with Spark2.4+ , we can use higher order functions to get your desired output:

from pyspark.sql import functions as F

df\
  .withColumn("most_common", F.expr("""sort_array(transform(array_distinct(values),\
                                      x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)),False)[0][1]"""))\
  .show(truncate=False)

#+---+----------------------------------------+-----------+
#|id |values                                  |most_common|
#+---+----------------------------------------+-----------+
#|1  |[good, good, good, bad, bad, good, good]|good       |
#|2  |[bad, badd, good, bad,, good, bad]      |bad        |
#+---+----------------------------------------+-----------+

We can also use array_max instead of sort_array .

from pyspark.sql import functions as F

df\
  .withColumn("most_common", F.expr("""array_max(transform(array_distinct(values),\
                                      x-> array(aggregate(values, 0,(acc,t)->acc+IF(t=x,1,0)),x)))[1]"""))\
  .show(truncate=False)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM