[英]pandas_udf with pd.Series and other object as arguments
I am having trouble with creating a Pandas UDF that performs a calculation on a pd Series based on a value in the same row of the underlying Spark Dataframe.我在创建 Pandas UDF 时遇到问题,该 UDF 基于底层 Spark Dataframe 同一行中的值对 pd 系列执行计算。
However, the most straight forward solution doesn't seem to be supported by the Pandas on Spark API:但是,Spark API 上的 Pandas 似乎不支持最直接的解决方案:
A very simple example like below一个非常简单的例子如下
from pyspark.sql.types import IntegerType
import pyspark.sql.functions as F
import pandas as pd
@F.pandas_udf(IntegerType())
def addition(arr: pd.Series, addition: int) -> pd.Series:
return arr.add(addition)
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show()
df.withColumn("added", addition(F.col("array"),F.col("addition")))
throws the following exception on the udf definition line在 udf 定义行抛出以下异常
NotImplementedError: Unsupported signature: (arr: pandas.core.series.Series, addition: int) -> pandas.core.series.Series.
Am i tackling this problem in a wrong way?我是否以错误的方式解决了这个问题? I could reimplement the whole "addition" function in native PySpark, but the real function I am talking about is terribly complex and would mean an enormous amount of rework.我可以在本机 PySpark 中重新实现整个“加法”function,但我所说的真正的 function 非常复杂,意味着大量返工。
Loading the example, adding import array
加载示例,添加import array
from pyspark.sql.types as T
import pyspark.sql.functions as F
import pandas as pd
from array import array
df = spark.createDataFrame([([1,2,3],10),([4,5,6],20)],["array","addition"])
df.show(truncate=False)
print(df.schema.fields)
The response is,回应是,
+---------+--------+
| array|addition|
+---------+--------+
|[1, 2, 3]| 10|
|[4, 5, 6]| 20|
+---------+--------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True)]
If you must use a Pandas function to complete your task here is an option for a solution that uses a Pandas function within a PySpark UDF,如果您必须使用 Pandas function 来完成您的任务,这里有一个解决方案选项,它在 PySpark UDF 中使用 Pandas function,
arr
column is ArrayType, convert it into a Pandas Series Spark DF的arr
列是ArrayType,转换成Pandas Series@F.udf(T.ArrayType(T.LongType()))
def addition_pd(arr, addition):
pd_arr = pd.Series(arr)
added = pd_arr.add(addition)
return array("l", added)
df = df.withColumn("added", addition_pd(F.col("array"),F.col("addition")))
df.show(truncate=False)
print(df.schema.fields)
Returns退货
+---------+--------+------------+
|array |addition|added |
+---------+--------+------------+
|[1, 2, 3]|10 |[11, 12, 13]|
|[4, 5, 6]|20 |[24, 25, 26]|
+---------+--------+------------+
[StructField('array', ArrayType(LongType(), True), True), StructField('addition', LongType(), True), StructField('added', ArrayType(LongType(), True), True)]
However, it is worth stating that when possible it is recommended to use PySpark Functions over the use of PySpark UDF (see here )但是,值得一提的是,如果可能,建议使用PySpark 函数而不是使用 PySpark UDF(参见此处)
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.