索引与 groupby PySpark

Question

I'm trying to translate the below pandas code to PySpark. But I'm having trouble with these two points:我正在尝试将以下 pandas 代码转换为 PySpark。但我在这两点上遇到了麻烦：

But there is index in Spark DataFrame?但是Spark DataFrame中有索引吗？
How can I group in level=0 like that?我怎样才能像这样在 level=0 中分组？

I didn't find anything good in the documentation.我没有在文档中找到任何好的东西。 If you have a hint, I'll be really grateful!如果您有提示，我将不胜感激！

df.set_index('var1', inplace=True)
df['varGrouped'] = df.groupby(level=0)['var2'].min()
df.reset_index(inplace=True)

Answer 1

But there is index in Spark DataFrame?但是Spark DataFrame中有索引吗？

i think the index in pandas doesn't exist in spark since spark is not designed to do row level manipulation.我认为 pandas 中的索引在 spark 中不存在，因为 spark 并非设计用于进行行级操作。

How can I group in level=0 like that?我怎样才能像这样在 level=0 中分组？

instead of group by level, you group directly by the columns which identifies the granularity level.您不是按级别分组，而是直接按标识粒度级别的列分组。

Answer 2

pandas_df.groupby(level=0) would group the pandas_df by the first index field (in case of multiindex data). pandas_df.groupby(level=0)将按第一个索引字段（在多索引数据的情况下）对pandas_df进行分组。 Since there is only 1 index field based on the provided code, your code is a simple group by the var1 field.由于根据提供的代码只有 1 个索引字段，因此您的代码是var1字段的简单分组。 The same can be replicated in pyspark with a groupBy() and taking the min of var2 .可以使用groupBy()并取var2的min在 pyspark 中复制相同的内容。

However, the aggregation result is stored in a new column within the same dataframe. So, the number of rows don't depreciate.但是，聚合结果存储在同一个 dataframe 内的新列中。因此，行数不会减少。 This can be replicated by using min window function .这可以通过使用min window function来复制。

import pyspark.sql.functions as func
from pyspark.sql.window import Window as wd

data_sdf. \
    withColumn('grouped_var', func.min('var2').over(wd.partitionBy('var1')))

withColumn helps you add/replace columns. withColumn帮助您添加/替换列。

Here's an example using sample data.下面是一个使用示例数据的示例。

data_sdf.show()

# +---+---+
# |  a|  b|
# +---+---+
# |  1|  2|
# |  1|  3|
# |  2|  5|
# |  2|  4|
# +---+---+

data_sdf. \
    withColumn('grouped_res', func.min('b').over(wd.partitionBy('a'))). \
    show()

# +---+---+-----------+
# |  a|  b|grouped_res|
# +---+---+-----------+
# |  1|  2|          2|
# |  1|  3|          2|
# |  2|  5|          4|
# |  2|  4|          4|
# +---+---+-----------+

索引与 groupby PySpark

问题描述

2 个解决方案

解决方案1
1 2022-10-04 13:51:08

解决方案2
1 已采纳 2022-10-04 15:45:51

索引与 groupby PySpark

问题描述

2 个解决方案

解决方案1 1 2022-10-04 13:51:08

解决方案2 1 已采纳 2022-10-04 15:45:51

解决方案1
1 2022-10-04 13:51:08

解决方案2
1 已采纳 2022-10-04 15:45:51