[英]Find the k most frequent words in each row from PySpark dataframe
I have a Spark dataframe that looks something like this:我有一个看起来像这样的 Spark dataframe:
columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"),
("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)
init_df.show(truncate = False)
+------------------+-----------------------------------------------------------+
|object_type |object_name |
+------------------+-----------------------------------------------------------+
|galaxy |andromeda,milky way,condor,andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |
+------------------+-----------------------------------------------------------+
I need to create a new column with the most frequent words from the object_name
column using PySpark.我需要使用 PySpark 使用object_name
列中最常用的词创建一个新列。
Conditions :条件:
"andromeda"
in the first row)如果行中有一个主导词(mode = 1),则选择该词作为最常见的词(如第一行中的"andromeda"
)"mars"
and "venus"
in the second row - they occur by 3 times, while the rest of the words are less common)如果行中有两个显性词出现的次数相同(模式 = 2),则 select 这两个词(如第二行中的"mars"
和"venus"
- 它们出现了 3 次,而 rest的词不太常见)"mira"
, "sun"
and "sirius"
which occur by 2 times, while the rest of the words only once)如果行中有三个主导词出现次数相等,则选择所有这三个词(如"mira"
、 "sun"
和"sirius"
,它们出现了 2 次,而 rest 只出现了一次)"many objects"
flag.如果一行中有四个或更多的显性词出现的次数相同(如第四行),则设置"many objects"
标志。Expected output:预计 output:
+-----------------+-----------------------------------------------------------+---------------+
|object_type |object_name |most_frequent |
+-----------------+-----------------------------------------------------------+---------------+
|galaxy |andromeda,milky way,condor,andromeda |andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|mars,venus |
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |mira,sun,sirius|
|natural satellite|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |many objects |
+-----------------+-----------------------------------------------------------+---------------+
I'll be very grateful for any advice!我非常感谢任何建议!
You can try this,你可以试试这个,
res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
.withColumn("most_frequent", F.udf(lambda x: ', '.join([mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))]))(F.col("list_obj"))) \
.drop("list_obj")
res_df.show(truncate=False)
+------------------+-----------------------------------------------------------+---------------------+
|object_type |object_name |most_frequent |
+------------------+-----------------------------------------------------------+---------------------+
|galaxy |andromeda,milky way,condor,andromeda |andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars |
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |sirius, mira, sun |
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |moon, kale, titan, io|
+------------------+-----------------------------------------------------------+---------------------+
EDIT:编辑:
According to OP's suggestion, we can achieve their desired output by doing something like this,根据OP的建议,我们可以通过这样做来实现他们想要的output,
from pyspark.sql.types import *
res_df = init_df.withColumn("list_obj", F.split(F.col("object_name"),",")) \
.withColumn("most_frequent", F.udf(lambda x: [mitem[1] for mitem in zip((x.count(item) for item in set(x)),set(x)) if mitem[0] == max((x.count(item) for item in set(x)))], ArrayType(StringType()))(F.col("list_obj"))) \
.withColumn("most_frequent", F.when(F.size(F.col("most_frequent")) >= 4, F.lit("many objects")).otherwise(F.concat_ws(", ", F.col("most_frequent")))) \
.drop("list_obj")
res_df.show(truncate=False)
+------------------+-----------------------------------------------------------+-----------------+
|object_type |object_name |most_frequent |
+------------------+-----------------------------------------------------------+-----------------+
|galaxy |andromeda,milky way,condor,andromeda |andromeda |
|planet |mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth|venus, mars |
|star |mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran |sirius, mira, sun|
|natural satellites|moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa |many objects |
+------------------+-----------------------------------------------------------+-----------------+
Try this:尝试这个:
from pyspark.sql import functions as psf
from pyspark.sql.window import Window
columns = ["object_type", "object_name"]
data = [("galaxy", "andromeda,milky way,condor,andromeda"),
("planet", "mars,jupiter,venus,mars,saturn,venus,earth,mars,venus,earth"),
("star", "mira,sun,altair,sun,sirius,rigel,mira,sirius,aldebaran"),
("natural satellites", "moon,io,io,elara,moon,kale,titan,kale,phobos,titan,europa")]
init_df = spark.createDataFrame(data).toDF(*columns)
# unpivot the object name and count
df_exp = init_df.withColumn('object_name_exp', psf.explode(psf.split('object_name',',')))
df_counts = df_exp.groupBy('object_type', 'object_name_exp').count()
window_spec = Window.partitionBy('object_type').orderBy(psf.col('count').desc())
df_ranked = df_counts.withColumn('rank', psf.dense_rank().over(window_spec))
# rank the counts, keeping the top ranked object names
df_top_ranked = df_ranked.filter(psf.col('rank')==psf.lit(1)).drop('count')
# count the number of top ranked object names
df_top_counts = df_top_ranked.groupBy('object_type', 'rank').count()
# join these back to the original object names
df_with_counts = df_top_ranked.join(df_top_counts, on='object_type', how='inner')
# implement the rules whether to retain the reference to the object name or state 'many objects'
df_most_freq = df_with_counts.withColumn('most_frequent'
, psf.when(psf.col('count')<=psf.lit(3), psf.col('object_name_exp')).otherwise(psf.lit('many objects'))
)
# collect the object names retained back into and array and de-duplicate them
df_results = df_most_freq.groupBy('object_type').agg(psf.array_distinct(psf.collect_list('most_frequent')).alias('most_frequent'))
# show output
df_results.show()
+------------------+-------------------+
| object_type| most_frequent|
+------------------+-------------------+
| galaxy| [andromeda]|
|natural satellites| [many objects]|
| planet| [mars, venus]|
| star|[sirius, mira, sun]|
+------------------+-------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.