如何计算由某些 groupby 变量聚合的 pyspark dataframe 中特定列的零百分比

Question

我有一个 pyspark dataframe 和以下列

source_cd   Day   Date  hour  five_min_block    five_min_block_volume

在这里，日期从 2020 年 1 月 31 日到 2021 年 3 月 31 日不等。相应地有“日”字段。 另外，source_cd 有 5 个类别，每个唯一日期的小时数从 0 到 23 不等，对应的 Five_min_block 从 1 到 12 不等。然后我的值列命名为 Five_min_block_volume。

现在这个 Five_min_block_volume 字段中可以有任何值，从 0 到任何正定数。 我想要做的是计算该列的零百分比，当由某些 groupby 变量聚合时（“日期”永远不会成为这个 groupby 变量的一部分）。

因此，假设我想按“Source_cd”、“Day”、“hour”和“five_min_block”对它进行分组（并且可能对 Five_min_block_volume 列执行平均聚合作为 output 列）。 本质上，我的新 dataframe 现在将包含 source_cd、Day、hour、five_min_block 字段，并且现在没有日期字段。

可以说，对于 source_cd、Day、hour、five_min_block 的特定组合，我原来的 dataframe 中有 50 个条目。 在这 50 个条目中，有 20 个的 5_min_block_volume 值为 0。 因此，对于这个组合，我想在这个分组的 dataframe 中将 40% 显示为我的“零百分比”列作为新创建的列。 对于所有其他行也是如此。 我想使用 pyspark 来实现这一点。 我该怎么做 go 关于这样做

Answer 1

我是否可以建议更快地响应和更清晰，如果您发布一些试图回答您问题的人可以轻松复制和粘贴以生成您描述的示例的代码，这很有用？ 无论如何，我已尽力在下面的描述中重现您的示例。

请注意，解决方案最后只有几行代码。 希望它可以提供帮助。

重现示例

import pandas as pd
import numpy as np
import pyspark.sql.functions as func

# create the dummy data in pandas then convert to pyspark df
pdf = pd.DataFrame(columns=['source_cd', 'Day', 'Date', 'hour', 'five_min_block', 'five_min_block_volume'])

# create the date range by 5 minute blocks
pdf['Date'] = pd.date_range(start='2020-01-31', end='2020-03-31', freq='5min')
n = pdf.shape[0]

# extract hour and day
pdf['Day'] = pdf['Date'].dt.day
pdf['hour'] =  pdf['Date'].dt.hour
pdf['date-temp'] = pdf['Date'].dt.date

# generate the 5 min block labels
pdf['five_min_block'] = 1
pdf['five_min_block'] = pdf.groupby(['date-temp', 'hour'])['five_min_block'].cumsum()
pdf.drop('date-temp', axis=1, inplace=True)

# random source column
pdf['source_cd'] = np.random.randint(low=0, high=5, size=n)

# random volumes, and add some extra zeros
pdf['five_min_block_volume'] = np.random.randint(low=0, high=20000, size=n)
pdf['five_min_block_volume'].iloc[np.random.choice(range(n),size=int(0.2*n))] = 0

转换为 spark dataframe，并转换为问题所述的分组

sdf = spark.createDataFrame(pdf)
grouping_columns = ['Source_cd', 'Day', 'hour', 'five_min_block']

sdf.groupBy(grouping_columns).agg(
    func.mean(func.col('five_min_block_volume')).alias('avg_of_block_volume'),
    func.mean((func.col('five_min_block_volume') == 0).cast('float')).alias('percent_blocks_with_0_volume')
    ).show()

Output：

Answer 2

你可以使用这样的东西：

@funcs.pandas_udf('float', funcs.PandasUDFType.GROUPED_AGG)
def percentage_of_zeroes_agg(percentage_of_zeroes_col: funcs.col) -> float:
    return percentage_of_zeroes_col.sum() / percentage_of_zeroes_col.count()


# == Example =============================================================================

# Columns to group dataframe by
groupby_columns = ['Source_cd', 'Day', 'hour']

# Aggregation expression, that computes the rate of zeroes for each group.
aggregation = percentage_of_zeroes_agg(df.percentage_of_zeroes).alias('percentage_of_zeroes')

# Perform the groupby operation
grouped_df = df.groupBy(*groupby_columns).agg(aggregation)

完整代码

这是整个代码，包括我为构建示例 dataframe 而创建的一些辅助函数，基于您提供的列描述。


# == Necessary Imports ===================================================================

from __future__ import annotations
import string

import pandas as pd
import numpy as np
import pyspark
from pyspark.sql import functions as funcs

from dateutil.relativedelta import relativedelta


# == Define spark session ================================================================

spark = pyspark.sql.SparkSession.builder.getOrCreate()


# == Helper functions to generate sample dataframe =======================================
# You can ignore these functions, as their purpose is only to create a sample dataframe to
# show how to solve your problem

def get_random_source_cd(n: int, num_cats: int = 5) -> list[str]:
    source_cd_cats = string.ascii_uppercase[:num_cats]
    return list(
        map(source_cd_cats.__getitem__, np.random.randint(0, num_cats, n))
    )


def get_random_hours(n: int) -> list[int]:
    return np.random.randint(0, 23, n).tolist()


def get_random_dates(
    n: int,
    start_date: str | pd.Timestamp,
    end_date: str | pd.Timestamp | None = None,
    days: int | None = None,
) -> list[pd.Timestamp]:

    start_date = pd.to_datetime(start_date)
    if end_date is None:
        if days is None:
            days = n * 2
        end_date = start_date + relativedelta(days=int(days))
    else:
        end_date = pd.to_datetime(end_date)

    possible_dates = pd.date_range(start_date, end_date, freq='d').to_series()
    return list(
        map(possible_dates.__getitem__, np.random.randint(0, len(possible_dates), n))
    )


def get_random_five_min_blocks(n: int) -> list[int]:
    return np.random.randint(0, 13, n).tolist()


def generate_random_frame(n: int, **kwargs) -> pd.DataFrame:
    dates = get_random_dates(
        n, '2022-06-01', end_date=kwargs.get('end_date', None), days=kwargs.get('days', None)
    )
    days = list(map(lambda date: date.day, dates))
    return spark.createDataFrame(
        pd.DataFrame(
            {
                'source_cd': get_random_source_cd(n),
                'Day': days,
                'Date': dates,
                'hour': get_random_hours(n),
                'five_min_block': get_random_five_min_blocks(n),
                'five_min_block_volume': np.random.random(n),
            }
        )
    ).withColumn(
        'percentage_of_zeroes',
        funcs.when(funcs.col('five_min_block') == 0, 1).otherwise(0)
    )


# == User defined function used during aggregation =======================================

@funcs.pandas_udf('float', funcs.PandasUDFType.GROUPED_AGG)
def percentage_of_zeroes_agg(percentage_of_zeroes_col: funcs.col) -> float:
    """Pandas user defined function to compute the percentage of zeroes during aggregation.
    
    Parameters
    ----------
    percentage_of_zeroes_col : funcs.col
        The `percentage_of_zeroes_col` column, as `pyspark.sql.column.Column`.
        You can specify this parameter like so:

        .. code-block:: python

            groupby_columns = ['Source_cd', 'Day']
            aggregation = percentage_of_zeroes_agg(df.percentage_of_zeroes).alias('percentage_of_zeroes')
            grouped_df = df.groupBy(*groupby_columns).agg(aggregation)

        In the above example, the aggregation variable shows how you can
        use this function.

    Returns
    -------
    float
        The rate of values with column `percentage_of_zeroes` equal to 1.

    Notes
    -----
    The `percentage_of_zeroes` column contains the value 1, when the column
    `five_min_block` equals zero, and 0 otherwise. Therefore, when you sum all values,
    you get the total count of rows from a given group that equal 0. The `count`
    returns the number of observations (rows) from each group.
    Dividing the sum by count, you get the ratio of zeroes on a given group.
    """
    return percentage_of_zeroes_col.sum() / percentage_of_zeroes_col.count()


# == Example =============================================================================

# Generate a randomized Spark Dataframe, based on your columns specifications
df = generate_random_frame(50_000, end_date='2023-12-31')

# Columns to group dataframe by
groupby_columns = ['Source_cd', 'Day', 'hour']

# Aggregation expression, that computes the rate of zeroes for each group.
# NOTE: edit the `.alias` parameter, to change the name of the column that stores
#       the aggregation results.
aggregation = percentage_of_zeroes_agg(df.percentage_of_zeroes).alias('percentage_of_zeroes')

# Perform the groupby operation
grouped_df = (
    df
    .groupBy(*groupby_columns)
    .agg(aggregation)
    # OPTIONAL: uncomment the next line, to sort the grouped dataframe
    #           by a set of columns (statement has a heavy impact on performance)
    # .orderBy('count_of_zeroes', ascending=False)
)

# OPTIONAL: create column `pretty_percentage_of_zeroes` to store results from aggregation
#           in percentage format.
grouped_df = grouped_df.withColumn(
    'pretty_percentage_of_zeroes',
    funcs.concat(
        (funcs.format_number(grouped_df.percentage_of_zeroes * 100, 2)).cast('string'),
        funcs.lit('%')
    )
)
grouped_df.show()

# +---------+---+----+--------------------+---------------+---------------------------+
# |Source_cd|Day|hour|percentage_of_zeroes|count_of_zeroes|pretty_percentage_of_zeroes|
# +---------+---+----+--------------------+---------------+---------------------------+
# |        A|  1|   0|          0.07692308|              1|                      7.69%|
# |        A|  1|   1|          0.11764706|              2|                     11.76%|
# |        A|  1|   2|         0.083333336|              1|                      8.33%|
# |        A|  1|   3|                 0.0|              0|                      0.00%|
# |        A|  1|   4|          0.13333334|              2|                     13.33%|
# |        A|  1|   5|                 0.0|              0|                      0.00%|
# |        A|  1|   6|                 0.2|              2|                     20.00%|
# |        A|  1|   7|                 0.0|              0|                      0.00%|
# |        A|  1|   8|           0.1764706|              3|                     17.65%|
# |        A|  1|   9|          0.10526316|              2|                     10.53%|
# |        A|  1|  10|                 0.0|              0|                      0.00%|
# |        A|  1|  11|                 0.0|              0|                      0.00%|
# |        A|  1|  12|               0.125|              2|                     12.50%|
# |        A|  1|  13|          0.05882353|              1|                      5.88%|
# |        A|  1|  14|         0.055555556|              1|                      5.56%|
# |        A|  1|  15|              0.0625|              1|                      6.25%|
# |        A|  1|  16|         0.083333336|              1|                      8.33%|
# |        A|  1|  17|         0.071428575|              1|                      7.14%|
# |        A|  1|  18|          0.11111111|              1|                     11.11%|
# |        A|  1|  19|          0.06666667|              1|                      6.67%|
# +---------+---+----+--------------------+---------------+---------------------------+

如何计算由某些 groupby 变量聚合的 pyspark dataframe 中特定列的零百分比

问题描述

2 个解决方案

解决方案1
0 2022-09-06 07:38:53

重现示例

转换为 spark dataframe，并转换为问题所述的分组

Output：

解决方案2
0 2022-09-06 08:11:45

完整代码

如何计算由某些 groupby 变量聚合的 pyspark dataframe 中特定列的零百分比

问题描述

2 个解决方案

解决方案1 0 2022-09-06 07:38:53

重现示例

转换为 spark dataframe，并转换为问题所述的分组

Output：

解决方案2 0 2022-09-06 08:11:45

完整代码

解决方案1
0 2022-09-06 07:38:53

解决方案2
0 2022-09-06 08:11:45