从 PySpark dataframe 中找出每一行出现频率最高的 k 个词

Question

I have a Spark dataframe that looks something like this:我有一个看起来像这样的 Spark dataframe：

columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
        ("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"), 
        ("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
        ("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)
init_df.show(truncate = False)

+------------------+-----------------------------------------------------------+
|object_type       |object_name                                                |
+------------------+-----------------------------------------------------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |
+------------------+-----------------------------------------------------------+

I need to create a new column with the most frequent words from the object_name column using PySpark.我需要使用 PySpark 使用object_name列中最常用的词创建一个新列。
Conditions :条件：

if there is one dominant word in the row (mode = 1), then choose this word as most frequent (like "andromeda" in the first row)如果行中有一个主导词（mode = 1），则选择该词作为最常见的词（如第一行中的"andromeda" ）
if there are two dominant words in the row that occur the equal number of times (mode = 2), then select both these words (like "mars" and "venus" in the second row - they occur by 3 times, while the rest of the words are less common)如果行中有两个显性词出现的次数相同（模式 = 2），则 select 这两个词（如第二行中的"mars"和"venus" - 它们出现了 3 次，而 rest的词不太常见）
if there are three dominant words in the row that occur an equal number of times, then pick all these three words (like "mira" , "sun" and "sirius" which occur by 2 times, while the rest of the words only once)如果行中有三个主导词出现次数相等，则选择所有这三个词（如"mira" 、 "sun"和"sirius" ，它们出现了 2 次，而 rest 只出现了一次)
if there are four or more dominant words in the row that occur an equal number of times (like in the fourth row), then set the "many objects" flag.如果一行中有四个或更多的显性词出现的次数相同（如第四行），则设置"many objects"标志。

Expected output:预计 output：

+-----------------+-----------------------------------------------------------+---------------+
|object_type      |object_name                                                |most_frequent  |
+-----------------+-----------------------------------------------------------+---------------+
|galaxy           |andromeda,milky way,condor,andromeda                       |andromeda      |
|planet           |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|mars,venus     |
|star             |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |mira,sun,sirius|
|natural satellite|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |many objects   |
+-----------------+-----------------------------------------------------------+---------------+

I'll be very grateful for any advice!我非常感谢任何建议！

Answer 1

You can try this,你可以试试这个，

res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
    .withColumn("most_frequent", F.udf(lambda x: ', '.join([mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))]))(F.col("list_obj"))) \
    .drop("list_obj") 

res_df.show(truncate=False)

+------------------+-----------------------------------------------------------+---------------------+
|object_type       |object_name                                                |most_frequent        |
+------------------+-----------------------------------------------------------+---------------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |andromeda            |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars          |
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |sirius, mira, sun    |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |moon, kale, titan, io|
+------------------+-----------------------------------------------------------+---------------------+

EDIT:编辑：

According to OP's suggestion, we can achieve their desired output by doing something like this,根据OP的建议，我们可以通过这样做来实现他们想要的output，

from pyspark.sql.types import *

res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
    .withColumn("most_frequent", F.udf(lambda x: [mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))], ArrayType(StringType()))(F.col("list_obj"))) \
    .withColumn("most_frequent", F.when(F.size(F.col("most_frequent")) >= 4, F.lit("many objects")).otherwise(F.concat_ws(", ", F.col("most_frequent")))) \
    .drop("list_obj")

res_df.show(truncate=False)

+------------------+-----------------------------------------------------------+-----------------+
|object_type       |object_name                                                |most_frequent    |
+------------------+-----------------------------------------------------------+-----------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |andromeda        |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars      |
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |sirius, mira, sun|
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |many objects     |
+------------------+-----------------------------------------------------------+-----------------+

Answer 2

Try this:尝试这个：

from pyspark.sql import functions as psf
from pyspark.sql.window import Window

columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
        ("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"), 
        ("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
        ("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)

# unpivot the object name and count   
df_exp = init_df.withColumn('object_name_exp', psf.explode(psf.split('object_name',',')))
df_counts = df_exp.groupBy('object_type', 'object_name_exp').count()

window_spec = Window.partitionBy('object_type').orderBy(psf.col('count').desc())
df_ranked = df_counts.withColumn('rank', psf.dense_rank().over(window_spec))

# rank the counts, keeping the top ranked object names
df_top_ranked = df_ranked.filter(psf.col('rank')==psf.lit(1)).drop('count')

# count the number of top ranked object names
df_top_counts = df_top_ranked.groupBy('object_type',  'rank').count()

# join these back to the original object names
df_with_counts = df_top_ranked.join(df_top_counts, on='object_type', how='inner')

# implement the rules whether to retain the reference to the object name or state 'many objects'
df_most_freq = df_with_counts.withColumn('most_frequent'
    , psf.when(psf.col('count')<=psf.lit(3), psf.col('object_name_exp')).otherwise(psf.lit('many objects'))
    )

# collect the object names retained back into and array and de-duplicate them
df_results = df_most_freq.groupBy('object_type').agg(psf.array_distinct(psf.collect_list('most_frequent')).alias('most_frequent'))

# show output                                                     
df_results.show()

+------------------+-------------------+
|       object_type|      most_frequent|
+------------------+-------------------+
|            galaxy|        [andromeda]|
|natural satellites|     [many objects]|
|            planet|      [mars, venus]|
|              star|[sirius, mira, sun]|
+------------------+-------------------+

从 PySpark dataframe 中找出每一行出现频率最高的 k 个词

问题描述

2 个解决方案

解决方案1
2 已采纳 2023-01-17 09:51:03

解决方案2
1 2023-01-17 09:41:04

从 PySpark dataframe 中找出每一行出现频率最高的 k 个词

问题描述

2 个解决方案

解决方案1 2 已采纳 2023-01-17 09:51:03

解决方案2 1 2023-01-17 09:41:04

解决方案1
2 已采纳 2023-01-17 09:51:03

解决方案2
1 2023-01-17 09:41:04