[英]Loop through each row in a grouped spark dataframe and parse to functions
I have a spark dataframe df
that looks like this:我有一个看起来像这样的火花 dataframe
df
:
+----+------+------+
|user| value|number|
+----+------+------+
| A | 25 | 13|
| A | 6 | 14|
| A | 2 | 11|
| A | 32 | 17|
| B | 22 | 19|
| B | 42 | 10|
| B | 43 | 32|
| C | 33 | 12|
| C | 90 | 21|
| C | 12 | 32|
| C | 22 | 32|
| C | 64 | 10|
| D | 32 | 23|
| D | 62 | 11|
| D | 32 | 13|
| E | 63 | 17|
+----+------+------+
I want to group the df
per user
and then iterate through each row in the user
groups to parse to a couple of functions that I have defined like below:我想对每个
user
的df
进行分组,然后遍历user
组中的每一行以解析我定义的几个函数,如下所示:
def first_function(df):
... # operation on df
return df
def second_function(df):
... # operation on df
return df
def third_function(df):
... # operation on df
return df
Based on this answer I'm aware I can extract a list of unique users like so:基于这个答案,我知道我可以像这样提取唯一用户列表:
from pyspark.sql import functions as F
users = [user[0] for user in df.select("user").distinct().collect()]
users_list = [df.filter(F.col('user')==user) for user in users]
But it is unclear to me how I can us this user_list
to iterate through my original df
per user
group so that I can feed them to my functions.但我不清楚如何使用这个
user_list
来遍历每个user
组的原始df
,以便我可以将它们提供给我的函数。 What is the best way to do this?做这个的最好方式是什么?
You can group the dataframe by user
and then use applyInPandas :您可以按
user
对 dataframe 进行分组,然后使用applyInPandas :
df = ...
def functions(pandas_df):
def first_function(pandas_df1):
# operation on pandas_df1
return pandas_df1
def second_function(pandas_df2):
# operation on pandas_df2
return pandas_df2
def third_function(pandas_df3):
# operation on pandas_df3
return pandas_df3
result = first_function(pandas_df)
result = second_function(result)
result = third_function(result)
return result
schema_of_returned_df_of_third_function = "user string, value long, number long"
df.groupBy("user").applyInPandas(functions, schema_of_returned_df_of_third_function).show()
functions
will be called by Spark with a Pandas dataframe for each group of the original Spark dataframe. Spark 将使用Pandas dataframe 为原始 Spark dataframe 的每组调用
functions
。 For the given testdata the function will be called 5 times, once per user.对于给定的测试数据,function 将被调用 5 次,每个用户一次。 The parameter
pandas_df
will contain a Pandas dataframe with all rows for the respective user.参数
pandas_df
将包含一个 Pandas dataframe 以及相应用户的所有行。 A good way to explore this possibility is to add print(pandas_df)
to functions
.探索这种可能性的一个好方法是将
print(pandas_df)
添加到functions
。
Inside of functions
, you can implement any logic that is required using normal Pandas code.在
functions
内部,您可以使用普通的 Pandas 代码实现所需的任何逻辑。 It is possible to add or drop columns and also to alter the number of rows of the Pandas dataframe.可以添加或删除列,也可以更改 Pandas dataframe 的行数。
schema_of_returned_df_of_third_function
should contain the structure of the returned Pandas dataframe of functions
. schema_of_returned_df_of_third_function
应该包含返回的 Pandas dataframe functions
的结构。
A downside of this approach is that each group of users has to fit completey into the memory of one of the Spark executors to prevent an OutOfMemory error.这种方法的一个缺点是每组用户都必须完全适应其中一个 Spark 执行程序的 memory 以防止出现 OutOfMemory 错误。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.