[英]pyspark using window function
I have a dataframe that contains rows which represent an instance of a rating for a particular movie by a user. 我有一个数据框,其中包含一些行,这些行代表用户对特定电影的评级实例。 Each movie can be rated in multiple categories by multiple users.
每部电影可以由多个用户在多个类别中评分。 This is resultant dataframe which I created using movie_lens data.
这是我使用movie_lens数据创建的结果数据框。
|movie_id|year|categories|
+--------+----+----------+
| 122|1990| Comedy|
| 122|1990| Romance|
| 185|1990| Action|
| 185|1990| Crime|
| 185|1990| Thriller|
| 231|1990| Comedy|
| 292|1990| Action|
| 292|1990| Drama|
| 292|1990| Sci-Fi|
| 292|1990| Thriller|
| 316|1990| Action|
| 316|1990| Adventure|
| 316|1990| Sci-Fi|
| 329|1990| Action|
| 329|1990| Adventure|
| 329|1990| Drama|
.
.
.
movie_id is the unique id of the movie, year is the year in which the an user rated the movie, category is one among 12 categories of the movie. movie_id是电影的唯一ID,year是用户对电影进行评级的年份,类别是电影的12个类别之一。 Partial File here
部分文件在这里
I want to find most rated movie in each decade in each category (counting frequency of each movie in each decade in each category) 我想查找每个类别中每个十年中获得最高评价的电影(计算每个类别中每个十年中每个电影的播放频率)
something like 就像是
+-----------------------------------+
| year | category | movie_id | rank |
+-----------------------------------+
| 1990 | Comedy | 1273 | 1 |
| 1990 | Comedy | 6547 | 2 |
| 1990 | Comedy | 8973 | 3 |
.
.
| 1990 | Comedy | 7483 | 10 |
.
.
| 1990 | Drama | 1273 | 1 |
| 1990 | Drama | 6547 | 2 |
| 1990 | Drama | 8973 | 3 |
.
.
| 1990 | Comedy | 7483 | 10 |
.
.
| 2000 | Comedy | 1273 | 1 |
| 2000 | Comedy | 6547 | 2 |
.
.
for every decade, top 10 movies in each category
I understand the pyspark window function needs to be used. 我了解需要使用pyspark窗口功能。 This is what I tried
这是我尝试过的
windowSpec = Window.partitionBy(res_agg['year']).orderBy(res_agg['categories'].desc())
final = res_agg.select(res_agg['year'], res_agg['movie_id'], res_agg['categories']).withColumn('rank', func.rank().over(windowSpec))
but it returns some thing like below: 但它返回如下内容:
+----+--------+------------------+----+
|year|movie_id| categories|rank|
+----+--------+------------------+----+
|2000| 8606|(no genres listed)| 1|
|2000| 1587| Action| 1|
|2000| 1518| Action| 1|
|2000| 2582| Action| 1|
|2000| 5460| Action| 1|
|2000| 27611| Action| 1|
|2000| 48304| Action| 1|
|2000| 54995| Action| 1|
|2000| 4629| Action| 1|
|2000| 26606| Action| 1|
|2000| 56775| Action| 1|
|2000| 62008| Action| 1|
I am pretty new to pyspark and is stuck here. 我对pyspark很陌生,被困在这里。 Can anyone guide me what I am doing wrong.
谁能指导我我做错了什么。
You're right, you need to use a window, but first, you need to perform a first aggregation to compute the frequencies. 没错,您需要使用一个窗口,但是首先,您需要执行第一次聚合以计算频率。
First, let's compute the decade. 首先,让我们计算十年。
df_decade = df.withColumn("decade", concat(substring(col("year"), 0, 3), lit("0")))
Then we compute the frequency by decade, category and movie_id: 然后,我们按十进制,类别和movie_id计算频率:
agg_df = df_decade\
.groupBy("decade", "category", "movie_id")\
.agg(count(col("*")).alias("freq"))
And finally, we define a window partionned by decade and category and select the top 10 using the rank function: 最后,我们定义一个按十年和类别划分的窗口,并使用rank函数选择前十名:
w = Window.partitionBy("decade", "category").orderBy(desc("freq"))
top10 = agg_df.withColumn("r", rank().over(w)).where(col("r") <= 10)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.