pandas_udf 与 pd.Series 和其他 object 为 arguments

Question

I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.我在创建 Pandas UDF 时遇到问题，该 UDF 基于底层 Spark Dataframe 同一行中的值对 pd 系列执行计算。

However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:但是，Spark API 上的 Pandas 似乎不支持最直接的解决方案：

A very simple example like below一个非常简单的例子如下

from pyspark.sql.types import IntegerType

import pyspark.sql.functions as F
import pandas as pd

@F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
  return arr.add(addition)

df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()

df.withColumn("added", addition(F.col("array"),F.col("addition")))

throws the following exception on the udf definition line在 udf 定义行抛出以下异常

NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.

Am i tackling this problem in a wrong way?我是否以错误的方式解决了这个问题？ I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.我可以在本机 PySpark 中重新实现整个“加法”function，但我所说的真正的 function 非常复杂，意味着大量返工。

Answer 1

Loading the example, adding import array加载示例，添加import array

from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array

df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)

The response is,回应是，

+---------+--------+
|    array|addition|
+---------+--------+
|[1, 2, 3]|      10|
|[4, 5, 6]|      20|
+---------+--------+

[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]

If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,如果您必须使用 Pandas function 来完成您的任务，这里有一个解决方案选项，它在 PySpark UDF 中使用 Pandas function，

The Spark DF arr column is ArrayType, convert it into a Pandas Series Spark DF的arr列是ArrayType，转换成Pandas Series
Apply the Pandas function申请 Pandas function
Then, convert the Pandas Series back to an array然后，将 Pandas 系列转换回数组

@F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
    pd_arr = pd.Series(arr)
    added = pd_arr.add(addition)
    return array("l", added)

df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)

Returns退货

+---------+--------+------------+
|array    |addition|added       |
+---------+--------+------------+
|[1, 2, 3]|10      |[11, 12, 13]|
|[4, 5, 6]|20      |[24, 25, 26]|
+---------+--------+------------+

[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]

However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here )但是，值得一提的是，如果可能，建议使用PySpark 函数而不是使用 PySpark UDF（参见此处）

pandas_udf 与 pd.Series 和其他 object 为 arguments

问题描述

1 个解决方案

解决方案1
0 2023-01-13 18:36:23

pandas_udf 与 pd.Series 和其他 object 为 arguments

问题描述

1 个解决方案

解决方案1 0 2023-01-13 18:36:23

解决方案1
0 2023-01-13 18:36:23