改进 Pyspark 中的 Pandas UDF

Question

我必须在 Pyspark 中的滑动 window 中执行聚合。 特别是，我必须执行以下操作：

一次考虑 100 天有价值的数据
GroupBy 给定的 ID 列
取聚合的最后一个值
对值求和并返回结果

这些任务必须在.rangeBetween(-100 days, 0)的滑动 window 中计算

I can easily achieve this result by constructing a Pandas UDF that takes as input some columns of the Pyspark DF, transform them into a Pandas DataFrame, and then compute the aggregation and return the scalar result. 然后将 UDF 应用于所需的滑动 Window。

尽管此解决方案运行良好，但由于 DF 包含数百万行，因此需要花费大量时间（3-4 小时）才能完成任务。 有没有办法改善这种操作的计算时间？ 我正在使用 Databricks 中的 Pyspark。

我的 pandas UDF 是：

@pandas_udf(FloatType(), PandasUDFType.GROUPED_AGG)
def method2(analyst: pd.Series, revisions: pd.Series) -> float:
  df = pd.DataFrame({
    'analyst': analyst,
    'revisions': revisions
  })
  return df.groupby('analyst').last()['revisions'].sum() / df.groupby('analyst').last()['revisions'].abs().sum()

并应用于：

days = lambda x: x*60*60*24
w = Window.partitionBy('csecid').orderBy(F.col('date').cast('timestamp').cast('long')).rangeBetween(-days(100), 0)
df = df.withColumn('new_col', method2(F.col('analystid'), F.col('revisions_improved')).over(w))

EDIT: I know that this kind of aggregation could be achieved by using numpy arrays and PySpark UDF are much more faster working with numpy structure. 但是，我想避免这种解决方案，因为我需要在相同的框架功能中应用，这些功能比显示的要复杂得多，并且很难用 numpy 复制。

Answer 1

我最近不得不实现类似的聚合，我的第一次尝试是使用 Pandas UDF 和滑动 windows。 性能非常糟糕，我设法通过使用以下方法来改进它。

尝试使用collect_list组合滑动 window 向量，然后用您的 UDF 组合 map 它们。 请注意，这仅在您的滑动 window 可以适合工作人员 memory （通常可以）时才有效。

这是我的测试代码。 第一部分只是您的代码，但作为一个完整的可重现示例。

import pandas as pd
import pyspark.sql.functions as F
from pyspark.sql import Window
from pyspark.sql.functions import pandas_udf, PandasUDFType, udf
from pyspark.sql.types import FloatType, StructType, StructField, IntegerType, StringType

df = spark.createDataFrame(
  [(1, "2021-04-01", 10, -30),
   (1, "2021-03-01", 10, 20),
   (1, "2021-02-01", 10, -1),
   (1, "2021-01-01", 10, 10),
   (1, "2020-12-01", 10, 5),
   (1, "2021-04-01", 20, -5),
   (1, "2021-03-01", 20, -4),
   (1, "2021-02-01", 20, -3),
   (2, "2021-03-01", 10, 5),
   (2, "2021-02-01", 10, 6),
  ], 
  StructType([
    StructField("csecid", StringType(), True), 
    StructField("date", StringType(), True), 
    StructField("analystid", IntegerType(), True), 
    StructField("revisions_improved", IntegerType(), True)
  ]))

### Baseline
@pandas_udf(FloatType(), PandasUDFType.GROUPED_AGG)
def method2(analyst: pd.Series, revisions: pd.Series) -> float:
  df = pd.DataFrame({
    'analyst': analyst,
    'revisions': revisions
  })
  return df.groupby('analyst').last()['revisions'].sum() / df.groupby('analyst').last()['revisions'].abs().sum()

days = lambda x: x*60*60*24
w = Window.partitionBy('csecid').orderBy(F.col('date').cast('timestamp').cast('long')).rangeBetween(-days(100), 0)

# df.withColumn('new_col', method2(F.col('analystid'), F.col('revisions_improved')).over(w))

建议的替代方案：

### Method 3
from typing import List

@udf(FloatType())
def method3(analyst: List[int], revisions: List[int]) -> float:
  df = pd.DataFrame({
    'analyst': analyst,
    'revisions': revisions
  })
  return float(df.groupby('analyst').last()['revisions'].sum() / df.groupby('analyst').last()['revisions'].abs().sum())

(df
.withColumn('new_col', method2(F.col('analystid'), F.col('revisions_improved')).over(w))

.withColumn('analyst_win', F.collect_list("analystid").over(w))
.withColumn('revisions_win', F.collect_list("revisions_improved").over(w))

.withColumn('method3', method3(F.collect_list("analystid").over(w), 
                               F.collect_list("revisions_improved").over(w)))
.orderBy("csecid", "date", "analystid")
.show(truncate=False))

结果：

+------+----------+---------+------------------+---------+----------------------------+-----------------------------+---------+
|csecid|date      |analystid|revisions_improved|new_col  |analyst_win                 |revisions_win                |method3  |
+------+----------+---------+------------------+---------+----------------------------+-----------------------------+---------+
|1     |2020-12-01|10       |5                 |1.0      |[10]                        |[5]                          |1.0      |
|1     |2021-01-01|10       |10                |1.0      |[10, 10]                    |[5, 10]                      |1.0      |
|1     |2021-02-01|10       |-1                |-1.0     |[10, 10, 10, 20]            |[5, 10, -1, -3]              |-1.0     |
|1     |2021-02-01|20       |-3                |-1.0     |[10, 10, 10, 20]            |[5, 10, -1, -3]              |-1.0     |
|1     |2021-03-01|10       |20                |0.6666667|[10, 10, 10, 20, 10, 20]    |[5, 10, -1, -3, 20, -4]      |0.6666667|
|1     |2021-03-01|20       |-4                |0.6666667|[10, 10, 10, 20, 10, 20]    |[5, 10, -1, -3, 20, -4]      |0.6666667|
|1     |2021-04-01|10       |-30               |-1.0     |[10, 10, 20, 10, 20, 10, 20]|[10, -1, -3, 20, -4, -30, -5]|-1.0     |
|1     |2021-04-01|20       |-5                |-1.0     |[10, 10, 20, 10, 20, 10, 20]|[10, -1, -3, 20, -4, -30, -5]|-1.0     |
|2     |2021-02-01|10       |6                 |1.0      |[10]                        |[6]                          |1.0      |
|2     |2021-03-01|10       |5                 |1.0      |[10, 10]                    |[6, 5]                       |1.0      |
+------+----------+---------+------------------+---------+----------------------------+-----------------------------+---------+

analyst_win和revisions_win只是为了展示如何创建滑动 windows 并将其传递到 UDF。 它们应该在生产中被删除。

将 Pandas groupby 移到 UDF 之外可能会提高性能。 Spark 可以完成这一步。 但是，我没有质疑那部分，因为您提到 function 不代表实际任务。

查看 SparkUI 中的性能，特别是应用 UDF 的任务的时间统计信息。 如果时间很长，请尝试使用repartition分区来增加分区数，以便每个任务执行较小的数据子集。

改进 Pyspark 中的 Pandas UDF

问题描述

1 个解决方案

解决方案1
2 2021-04-15 23:24:05

改进 Pyspark 中的 Pandas UDF

问题描述

1 个解决方案

解决方案1 2 2021-04-15 23:24:05

解决方案1
2 2021-04-15 23:24:05