循环遍历分组火花 dataframe 中的每一行并解析为函数

Question

I have a spark dataframe df that looks like this:我有一个看起来像这样的火花 dataframe df ：

+----+------+------+
|user| value|number|
+----+------+------+
| A  | 25   |    13|
| A  | 6    |    14|
| A  | 2    |    11|
| A  | 32   |    17|
| B  | 22   |    19|
| B  | 42   |    10|
| B  | 43   |    32|
| C  | 33   |    12|
| C  | 90   |    21|
| C  | 12   |    32|
| C  | 22   |    32|
| C  | 64   |    10|
| D  | 32   |    23|
| D  | 62   |    11|
| D  | 32   |    13|
| E  | 63   |    17|
+----+------+------+

I want to group the df per user and then iterate through each row in the user groups to parse to a couple of functions that I have defined like below:我想对每个user的df进行分组，然后遍历user组中的每一行以解析我定义的几个函数，如下所示：

   def first_function(df):
   ... # operation on df
      return df

   def second_function(df):
   ... # operation on df
      return df

   def third_function(df):
    ... # operation on df
   return df

Based on this answer I'm aware I can extract a list of unique users like so:基于这个答案，我知道我可以像这样提取唯一用户列表：

from pyspark.sql import functions as F

users = [user[0] for user in df.select("user").distinct().collect()]
users_list = [df.filter(F.col('user')==user) for user in users]

But it is unclear to me how I can us this user_list to iterate through my original df per user group so that I can feed them to my functions.但我不清楚如何使用这个user_list来遍历每个user组的原始df ，以便我可以将它们提供给我的函数。 What is the best way to do this?做这个的最好方式是什么？

Answer 1

You can group the dataframe by user and then use applyInPandas :您可以按user对 dataframe 进行分组，然后使用applyInPandas ：

df = ...

def functions(pandas_df):
    def first_function(pandas_df1):
        # operation on pandas_df1
        return pandas_df1
    
    def second_function(pandas_df2):
        # operation on pandas_df2
        return pandas_df2

    def third_function(pandas_df3):
        # operation on pandas_df3
        return pandas_df3

    result = first_function(pandas_df)
    result = second_function(result)
    result = third_function(result)
    return result

schema_of_returned_df_of_third_function = "user string, value long, number long"

df.groupBy("user").applyInPandas(functions, schema_of_returned_df_of_third_function).show()

functions will be called by Spark with a Pandas dataframe for each group of the original Spark dataframe. Spark 将使用Pandas dataframe 为原始 Spark dataframe 的每组调用functions 。 For the given testdata the function will be called 5 times, once per user.对于给定的测试数据，function 将被调用 5 次，每个用户一次。 The parameter pandas_df will contain a Pandas dataframe with all rows for the respective user.参数pandas_df将包含一个 Pandas dataframe 以及相应用户的所有行。 A good way to explore this possibility is to add print(pandas_df) to functions .探索这种可能性的一个好方法是将print(pandas_df)添加到functions 。

Inside of functions , you can implement any logic that is required using normal Pandas code.在functions内部，您可以使用普通的 Pandas 代码实现所需的任何逻辑。 It is possible to add or drop columns and also to alter the number of rows of the Pandas dataframe.可以添加或删除列，也可以更改 Pandas dataframe 的行数。 schema_of_returned_df_of_third_function should contain the structure of the returned Pandas dataframe of functions . schema_of_returned_df_of_third_function应该包含返回的 Pandas dataframe functions的结构。

A downside of this approach is that each group of users has to fit completey into the memory of one of the Spark executors to prevent an OutOfMemory error.这种方法的一个缺点是每组用户都必须完全适应其中一个 Spark 执行程序的 memory 以防止出现 OutOfMemory 错误。

循环遍历分组火花 dataframe 中的每一行并解析为函数

问题描述

1 个解决方案

解决方案1
1 2021-05-15 13:28:57

循环遍历分组火花 dataframe 中的每一行并解析为函数

问题描述

1 个解决方案

解决方案1 1 2021-05-15 13:28:57

解决方案1
1 2021-05-15 13:28:57