簡體   English   中英

從 PySpark dataframe 中找出每一行出現頻率最高的 k 個詞

[英]Find the k most frequent words in each row from PySpark dataframe

我有一個看起來像這樣的 Spark dataframe:

columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
        ("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"), 
        ("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
        ("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)
init_df.show(truncate = False)

+------------------+-----------------------------------------------------------+
|object_type       |object_name                                                |
+------------------+-----------------------------------------------------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |
+------------------+-----------------------------------------------------------+

我需要使用 PySpark 使用object_name列中最常用的詞創建一個新列。
條件

  • 如果行中有一個主導詞(mode = 1),則選擇該詞作為最常見的詞(如第一行中的"andromeda"
  • 如果行中有兩個顯性詞出現的次數相同(模式 = 2),則 select 這兩個詞(如第二行中的"mars""venus" - 它們出現了 3 次,而 rest的詞不太常見)
  • 如果行中有三個主導詞出現次數相等,則選擇所有這三個詞(如"mira""sun""sirius" ,它們出現了 2 次,而 rest 只出現了一次)
  • 如果一行中有四個或更多的顯性詞出現的次數相同(如第四行),則設置"many objects"標志。

預計 output:

+-----------------+-----------------------------------------------------------+---------------+
|object_type      |object_name                                                |most_frequent  |
+-----------------+-----------------------------------------------------------+---------------+
|galaxy           |andromeda,milky way,condor,andromeda                       |andromeda      |
|planet           |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|mars,venus     |
|star             |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |mira,sun,sirius|
|natural satellite|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |many objects   |
+-----------------+-----------------------------------------------------------+---------------+

我非常感謝任何建議!

你可以試試這個,

res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
    .withColumn("most_frequent", F.udf(lambda x: ', '.join([mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))]))(F.col("list_obj"))) \
    .drop("list_obj") 

res_df.show(truncate=False)
+------------------+-----------------------------------------------------------+---------------------+
|object_type       |object_name                                                |most_frequent        |
+------------------+-----------------------------------------------------------+---------------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |andromeda            |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars          |
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |sirius, mira, sun    |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |moon, kale, titan, io|
+------------------+-----------------------------------------------------------+---------------------+

編輯:

根據OP的建議,我們可以通過這樣做來實現他們想要的output,

from pyspark.sql.types import *

res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
    .withColumn("most_frequent", F.udf(lambda x: [mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))], ArrayType(StringType()))(F.col("list_obj"))) \
    .withColumn("most_frequent", F.when(F.size(F.col("most_frequent")) >= 4, F.lit("many objects")).otherwise(F.concat_ws(", ", F.col("most_frequent")))) \
    .drop("list_obj")

res_df.show(truncate=False)
+------------------+-----------------------------------------------------------+-----------------+
|object_type       |object_name                                                |most_frequent    |
+------------------+-----------------------------------------------------------+-----------------+
|galaxy            |andromeda,milky way,condor,andromeda                       |andromeda        |
|planet            |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars      |
|star              |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran     |sirius, mira, sun|
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa  |many objects     |
+------------------+-----------------------------------------------------------+-----------------+

嘗試這個:

from pyspark.sql import functions as psf
from pyspark.sql.window import Window

columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
        ("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"), 
        ("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
        ("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)

# unpivot the object name and count   
df_exp = init_df.withColumn('object_name_exp', psf.explode(psf.split('object_name',',')))
df_counts = df_exp.groupBy('object_type', 'object_name_exp').count()

window_spec = Window.partitionBy('object_type').orderBy(psf.col('count').desc())
df_ranked = df_counts.withColumn('rank', psf.dense_rank().over(window_spec))

# rank the counts, keeping the top ranked object names
df_top_ranked = df_ranked.filter(psf.col('rank')==psf.lit(1)).drop('count')

# count the number of top ranked object names
df_top_counts = df_top_ranked.groupBy('object_type',  'rank').count()

# join these back to the original object names
df_with_counts = df_top_ranked.join(df_top_counts, on='object_type', how='inner')

# implement the rules whether to retain the reference to the object name or state 'many objects'
df_most_freq = df_with_counts.withColumn('most_frequent'
    , psf.when(psf.col('count')<=psf.lit(3), psf.col('object_name_exp')).otherwise(psf.lit('many objects'))
    )

# collect the object names retained back into and array and de-duplicate them
df_results = df_most_freq.groupBy('object_type').agg(psf.array_distinct(psf.collect_list('most_frequent')).alias('most_frequent'))

# show output                                                     
df_results.show()

+------------------+-------------------+
|       object_type|      most_frequent|
+------------------+-------------------+
|            galaxy|        [andromeda]|
|natural satellites|     [many objects]|
|            planet|      [mars, venus]|
|              star|[sirius, mira, sun]|
+------------------+-------------------+

暫無
暫無

聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.

 
粵ICP備18138465號  © 2020-2024 STACKOOM.COM