pyspark使用窗口功能

Question

I have a dataframe that contains rows which represent an instance of a rating for a particular movie by a user. 我有一个数据框，其中包含一些行，这些行代表用户对特定电影的评级实例。 Each movie can be rated in multiple categories by multiple users. 每部电影可以由多个用户在多个类别中评分。 This is resultant dataframe which I created using movie_lens data. 这是我使用movie_lens数据创建的结果数据框。

|movie_id|year|categories|
+--------+----+----------+
|     122|1990|    Comedy|
|     122|1990|   Romance|
|     185|1990|    Action|
|     185|1990|     Crime|
|     185|1990|  Thriller|
|     231|1990|    Comedy|
|     292|1990|    Action|
|     292|1990|     Drama|
|     292|1990|    Sci-Fi|
|     292|1990|  Thriller|
|     316|1990|    Action|
|     316|1990| Adventure|
|     316|1990|    Sci-Fi|
|     329|1990|    Action|
|     329|1990| Adventure|
|     329|1990|     Drama|
.
.
.

movie_id is the unique id of the movie, year is the year in which the an user rated the movie, category is one among 12 categories of the movie. movie_id是电影的唯一ID，year是用户对电影进行评级的年份，类别是电影的12个类别之一。 Partial File here 部分文件在这里

I want to find most rated movie in each decade in each category (counting frequency of each movie in each decade in each category) 我想查找每个类别中每个十年中获得最高评价的电影（计算每个类别中每个十年中每个电影的播放频率）

something like 就像是

+-----------------------------------+
| year | category | movie_id | rank |
+-----------------------------------+
| 1990 | Comedy   | 1273     | 1    |
| 1990 | Comedy   | 6547     | 2    |
| 1990 | Comedy   | 8973     | 3    |
.
.
| 1990 | Comedy   | 7483     | 10   |
.
.
| 1990 | Drama    | 1273     | 1    |
| 1990 | Drama    | 6547     | 2    |
| 1990 | Drama    | 8973     | 3    |
.
.
| 1990 | Comedy   | 7483     | 10   |  
.
.
| 2000 | Comedy   | 1273     | 1    |
| 2000 | Comedy   | 6547     | 2    |
.
.

for every decade, top 10 movies in each category

I understand the pyspark window function needs to be used. 我了解需要使用pyspark窗口功能。 This is what I tried 这是我尝试过的

windowSpec = Window.partitionBy(res_agg['year']).orderBy(res_agg['categories'].desc())

final = res_agg.select(res_agg['year'], res_agg['movie_id'], res_agg['categories']).withColumn('rank', func.rank().over(windowSpec))

but it returns some thing like below: 但它返回如下内容：

+----+--------+------------------+----+
|year|movie_id|        categories|rank|
+----+--------+------------------+----+
|2000|    8606|(no genres listed)|   1|
|2000|    1587|            Action|   1|
|2000|    1518|            Action|   1|
|2000|    2582|            Action|   1|
|2000|    5460|            Action|   1|
|2000|   27611|            Action|   1|
|2000|   48304|            Action|   1|
|2000|   54995|            Action|   1|
|2000|    4629|            Action|   1|
|2000|   26606|            Action|   1|
|2000|   56775|            Action|   1|
|2000|   62008|            Action|   1|

I am pretty new to pyspark and is stuck here. 我对pyspark很陌生，被困在这里。 Can anyone guide me what I am doing wrong. 谁能指导我我做错了什么。

Answer 1

You're right, you need to use a window, but first, you need to perform a first aggregation to compute the frequencies. 没错，您需要使用一个窗口，但是首先，您需要执行第一次聚合以计算频率。

First, let's compute the decade. 首先，让我们计算十年。

df_decade = df.withColumn("decade", concat(substring(col("year"), 0, 3), lit("0")))

Then we compute the frequency by decade, category and movie_id: 然后，我们按十进制，类别和movie_id计算频率：

agg_df = df_decade\
      .groupBy("decade", "category", "movie_id")\
      .agg(count(col("*")).alias("freq"))

And finally, we define a window partionned by decade and category and select the top 10 using the rank function: 最后，我们定义一个按十年和类别划分的窗口，并使用rank函数选择前十名：

w = Window.partitionBy("decade", "category").orderBy(desc("freq"))
top10 = agg_df.withColumn("r", rank().over(w)).where(col("r") <= 10)

pyspark使用窗口功能

问题描述

1 个解决方案

解决方案1
3 已采纳 2019-03-26 10:49:08

pyspark使用窗口功能

问题描述

1 个解决方案

解决方案1 3 已采纳 2019-03-26 10:49:08

解决方案1
3 已采纳 2019-03-26 10:49:08