在pyspark中，如何通过一列数据框循环过滤功能？

Question

This is the data I have: 这是我的数据：

**name** **movie**
jason        a
jason        b
jason        c
mike         a
mike         b
bruce        a
bruce        c
ryan         b

my goal is to make this 我的目标是做到这一点

**name** **# of moive**
jason       a,b,c
mike         a,b
bruce        a,c
ryan          b

I am using pyspark and try to use UDF to do this staff. 我正在使用pyspark并尝试使用UDF来完成这个工作人员。 I defined this function and spark gave me a error because it calls the basic functions 'filter', which makes a problem starting a new worker(correct me if it does not). 我定义了这个函数并且spark给了我一个错误，因为它调用了基本函数'filter'，这使得启动一个新工作者出现了问题（如果没有，请纠正我）。

My logic is first use a filter to make subsets and then the number of rows would be the number of movies. 我的逻辑是首先使用过滤器来制作子集，然后行数就是电影的数量。 And after this I make a new column with this UDF. 在此之后，我使用此UDF创建了一个新列。

def udf(user_name):
    return df.filter(df['name'] == user_name).select('movie').dropDuplictes()\
                                    .toPandas['movie'].tolist()

df.withColumn('movie_number', udf(df['name']))

but it's not working. 但它不起作用。 Is there a way to make a UDF with basic spark functions? 有没有办法用基本的火花功能制作UDF？

So I make the name column into a list and loop through the list, but it's super slow I believe this way I did not do distributed computing. 所以我将名称列放入列表并循环遍历列表，但它超级慢我相信这样我没有做分布式计算。

1) My priority is to figure out how to loop through information in one column of pyspark dataframe with basic functions such as spark_df.filter . 1）我的优先级是要弄清楚通过信息如何循环与基本功能，如数据帧pyspark的一列spark_df.filter 。

2) Can we first make the name column into a RDD and then use my UDF to loop through that RDD, so can take the advantage of distributed computing? 2）我们可以先将名称列放入RDD，然后使用我的UDF循环遍历该RDD，那么可以利用分布式计算吗？

3) If I have 2 tables with the same structure(name/movie), but for different years, like 2005 and 2007 can we have an efficient way to make a third table whose structure is: 3）如果我有2个具有相同结构（名称/电影）的表，但是对于不同年份，如2005年和2007年，我们可以有效地制作第三个表，其结构如下：

**name** **movie** **in_2005** **in_2007** 
jason        a          1           0
jason        b          0           1
jason        c          1           1
mike         a          0           1
mike         b          1           0
bruce        a          0           0
bruce        c          1           1
ryan         b          1           0

1 and 0 means if this guy made comment on the movie in year 2005/2007 or not. 1和0表示该人是否在2005/2007年对该电影发表评论。 and in this case the original tables would be： 在这种情况下，原始表将是：

2005： 2005年：

**name** **movie**
jason        a
jason        c
mike         b
bruce        c
ryan         b

2007 2007年

**name** **movie**
jason        b
jason        c
mike         a
bruce        c

and my idea is to concat the 2 tables together with a 'year' column, and use a pivot table to get the desired structure. 我的想法是将2个表与“年”列连在一起，并使用数据透视表来获得所需的结构。

Answer 1

I suggest to use groupby follow by collect_list instead of turning the whole dataframe to RDD. 我建议使用groupby跟随collect_list而不是将整个数据帧转换为RDD。 You can apply UDF after. 您可以在之后应用UDF。

import pyspark.sql.functions as func

# toy example dataframe
ls = [
    ['jason', 'movie_1'],
    ['jason', 'movie_2'],
    ['jason', 'movie_3'],
    ['mike', 'movie_1'],
    ['mike', 'movie_2'],
    ['bruce', 'movie_1'],
    ['bruce', 'movie_3'],
    ['ryan', 'movie_2']
]
df = spark.createDataFrame(pd.DataFrame(ls, columns=['name', 'movie']))

df_movie = df.groupby('name').agg(func.collect_list(func.col('movie')))

Now, this is an example to create udf to deal with new column movies . 现在，这是创建udf来处理新列movies的示例。 I simply give an example on how to calculate length of each row. 我只是举例说明如何计算每一行的长度。

def movie_len(movies):
    return len(movies)
udf_movie_len = func.udf(movie_len, returnType=StringType())

df_movie.select('name', 'movies', udf_movie_len(func.col('movies')).alias('n_movies')).show()

This will give: 这将给出：

+-----+--------------------+--------+
| name|              movies|n_movies|
+-----+--------------------+--------+
|jason|[movie_1, movie_2...|       3|
| ryan|           [movie_2]|       1|
|bruce|  [movie_1, movie_3]|       2|
| mike|  [movie_1, movie_2]|       2|
+-----+--------------------+--------+

在pyspark中，如何通过一列数据框循环过滤功能？

问题描述

1 个解决方案

解决方案1
0 已采纳 2017-05-02 00:33:49

在pyspark中，如何通过一列数据框循环过滤功能？

问题描述

1 个解决方案

解决方案1 0 已采纳 2017-05-02 00:33:49

解决方案1
0 已采纳 2017-05-02 00:33:49