循環遍歷分組火花 dataframe 中的每一行並解析為函數

Question

我有一個看起來像這樣的火花 dataframe df ：

+----+------+------+
|user| value|number|
+----+------+------+
| A  | 25   |    13|
| A  | 6    |    14|
| A  | 2    |    11|
| A  | 32   |    17|
| B  | 22   |    19|
| B  | 42   |    10|
| B  | 43   |    32|
| C  | 33   |    12|
| C  | 90   |    21|
| C  | 12   |    32|
| C  | 22   |    32|
| C  | 64   |    10|
| D  | 32   |    23|
| D  | 62   |    11|
| D  | 32   |    13|
| E  | 63   |    17|
+----+------+------+

我想對每個user的df進行分組，然后遍歷user組中的每一行以解析我定義的幾個函數，如下所示：

   def first_function(df):
   ... # operation on df
      return df

   def second_function(df):
   ... # operation on df
      return df

   def third_function(df):
    ... # operation on df
   return df

基於這個答案，我知道我可以像這樣提取唯一用戶列表：

from pyspark.sql import functions as F

users = [user[0] for user in df.select("user").distinct().collect()]
users_list = [df.filter(F.col('user')==user) for user in users]

但我不清楚如何使用這個user_list來遍歷每個user組的原始df ，以便我可以將它們提供給我的函數。 做這個的最好方式是什么？

Answer 1

您可以按user對 dataframe 進行分組，然后使用applyInPandas ：

df = ...

def functions(pandas_df):
    def first_function(pandas_df1):
        # operation on pandas_df1
        return pandas_df1
    
    def second_function(pandas_df2):
        # operation on pandas_df2
        return pandas_df2

    def third_function(pandas_df3):
        # operation on pandas_df3
        return pandas_df3

    result = first_function(pandas_df)
    result = second_function(result)
    result = third_function(result)
    return result

schema_of_returned_df_of_third_function = "user string, value long, number long"

df.groupBy("user").applyInPandas(functions, schema_of_returned_df_of_third_function).show()

Spark 將使用Pandas dataframe 為原始 Spark dataframe 的每組調用functions 。 對於給定的測試數據，function 將被調用 5 次，每個用戶一次。 參數pandas_df將包含一個 Pandas dataframe 以及相應用戶的所有行。 探索這種可能性的一個好方法是將print(pandas_df)添加到functions 。

在functions內部，您可以使用普通的 Pandas 代碼實現所需的任何邏輯。 可以添加或刪除列，也可以更改 Pandas dataframe 的行數。 schema_of_returned_df_of_third_function應該包含返回的 Pandas dataframe functions的結構。

這種方法的一個缺點是每組用戶都必須完全適應其中一個 Spark 執行程序的 memory 以防止出現 OutOfMemory 錯誤。

循環遍歷分組火花 dataframe 中的每一行並解析為函數

問題描述

1 個解決方案

解決方案1
1 2021-05-15 13:28:57

循環遍歷分組火花 dataframe 中的每一行並解析為函數

問題描述

1 個解決方案

解決方案1 1 2021-05-15 13:28:57

解決方案1
1 2021-05-15 13:28:57