從 PySpark dataframe 中找出每一行出現頻率最高的 k 個詞

Question

我有一個看起來像這樣的 Spark dataframe：

columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
        ("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"), 
        ("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
        ("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)
init_df.show(truncate = False)

+------------------+-----------------------------------------------------------+
|object_type       |object_name                                                |
+------------------+-----------------------------------------------------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |
+------------------+-----------------------------------------------------------+

我需要使用 PySpark 使用object_name列中最常用的詞創建一個新列。
條件：

如果行中有一個主導詞（mode = 1），則選擇該詞作為最常見的詞（如第一行中的"andromeda" ）
如果行中有兩個顯性詞出現的次數相同（模式 = 2），則 select 這兩個詞（如第二行中的"mars"和"venus" - 它們出現了 3 次，而 rest的詞不太常見）
如果行中有三個主導詞出現次數相等，則選擇所有這三個詞（如"mira" 、 "sun"和"sirius" ，它們出現了 2 次，而 rest 只出現了一次)
如果一行中有四個或更多的顯性詞出現的次數相同（如第四行），則設置"many objects"標志。

預計 output：

+-----------------+-----------------------------------------------------------+---------------+
|object_type      |object_name                                                |most_frequent  |
+-----------------+-----------------------------------------------------------+---------------+
|galaxy           |andromeda,milky way,condor,andromeda                       |andromeda      |
|planet           |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|mars,venus     |
|star             |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |mira,sun,sirius|
|natural satellite|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |many objects   |
+-----------------+-----------------------------------------------------------+---------------+

我非常感謝任何建議！

Answer 1

你可以試試這個，

res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
    .withColumn("most_frequent", F.udf(lambda x: ', '.join([mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))]))(F.col("list_obj"))) \
    .drop("list_obj") 

res_df.show(truncate=False)

+------------------+-----------------------------------------------------------+---------------------+
|object_type       |object_name                                                |most_frequent        |
+------------------+-----------------------------------------------------------+---------------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |andromeda            |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars          |
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |sirius, mira, sun    |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |moon, kale, titan, io|
+------------------+-----------------------------------------------------------+---------------------+

編輯：

根據OP的建議，我們可以通過這樣做來實現他們想要的output，

from pyspark.sql.types import *

res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
    .withColumn("most_frequent", F.udf(lambda x: [mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))], ArrayType(StringType()))(F.col("list_obj"))) \
    .withColumn("most_frequent", F.when(F.size(F.col("most_frequent")) >= 4, F.lit("many objects")).otherwise(F.concat_ws(", ", F.col("most_frequent")))) \
    .drop("list_obj")

res_df.show(truncate=False)

+------------------+-----------------------------------------------------------+-----------------+
|object_type       |object_name                                                |most_frequent    |
+------------------+-----------------------------------------------------------+-----------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |andromeda        |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars      |
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |sirius, mira, sun|
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |many objects     |
+------------------+-----------------------------------------------------------+-----------------+

Answer 2

嘗試這個：

from pyspark.sql import functions as psf
from pyspark.sql.window import Window

columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
        ("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"), 
        ("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
        ("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)

# unpivot the object name and count   
df_exp = init_df.withColumn('object_name_exp', psf.explode(psf.split('object_name',',')))
df_counts = df_exp.groupBy('object_type', 'object_name_exp').count()

window_spec = Window.partitionBy('object_type').orderBy(psf.col('count').desc())
df_ranked = df_counts.withColumn('rank', psf.dense_rank().over(window_spec))

# rank the counts, keeping the top ranked object names
df_top_ranked = df_ranked.filter(psf.col('rank')==psf.lit(1)).drop('count')

# count the number of top ranked object names
df_top_counts = df_top_ranked.groupBy('object_type',  'rank').count()

# join these back to the original object names
df_with_counts = df_top_ranked.join(df_top_counts, on='object_type', how='inner')

# implement the rules whether to retain the reference to the object name or state 'many objects'
df_most_freq = df_with_counts.withColumn('most_frequent'
    , psf.when(psf.col('count')<=psf.lit(3), psf.col('object_name_exp')).otherwise(psf.lit('many objects'))
    )

# collect the object names retained back into and array and de-duplicate them
df_results = df_most_freq.groupBy('object_type').agg(psf.array_distinct(psf.collect_list('most_frequent')).alias('most_frequent'))

# show output                                                     
df_results.show()

+------------------+-------------------+
|       object_type|      most_frequent|
+------------------+-------------------+
|            galaxy|        [andromeda]|
|natural satellites|     [many objects]|
|            planet|      [mars, venus]|
|              star|[sirius, mira, sun]|
+------------------+-------------------+

從 PySpark dataframe 中找出每一行出現頻率最高的 k 個詞

問題描述

2 個解決方案

解決方案1
2 已采納 2023-01-17 09:51:03

解決方案2
1 2023-01-17 09:41:04

從 PySpark dataframe 中找出每一行出現頻率最高的 k 個詞

問題描述

2 個解決方案

解決方案1 2 已采納 2023-01-17 09:51:03

解決方案2 1 2023-01-17 09:41:04

解決方案1
2 已采納 2023-01-17 09:51:03

解決方案2
1 2023-01-17 09:41:04