[英]apply udf to multiple columns and use numpy operations
I have a dataframe named result in pyspark and I want to apply a udf to create a new column as below:我在 pyspark 中有一个名为 dataframe 的结果,我想应用一个 udf 来创建一个新列,如下所示:
result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs")
@udf("float")
def newFunction(arr):
return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
result=result.withColumn("new_function_result",newFunction_udf(array("count","df","docs")))
the column count,df,docs all are integer columns.but this returns列数、df、docs 都是 integer 列。但这会返回
Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Py4JError:调用 z:org.apache.spark.sql.functions.col 时出错。 Trace: py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Trace: py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339 ) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run (GatewayConnection.java:214) 在 java.lang.Thread.run(Thread.java:748)
When I try passing one column and getting squares of those it works fine.当我尝试通过一列并获得其中的正方形时,它工作正常。
Any help is appreciated.任何帮助表示赞赏。
The error message is misleading, but is trying to tell you that your function doesn't return a float.该错误消息具有误导性,但试图告诉您您的 function 不返回浮点数。 Your function returns value of type numpy.float64
which you can fetch with the VectorUDT type (Function: newFunctionVector
in the example below).您的 function 返回numpy.float64
类型的值,您可以使用 VectorUDT 类型获取该值(函数:下面示例中的newFunctionVector
)。 Another way to make use of numpy is by casting the numpy type numpy.float64
to the python type float (Function: newFunctionWithArray
in the example below). Another way to make use of numpy is by casting the numpy type numpy.float64
to the python type float (Function: newFunctionWithArray
in the example below).
Last but not least, it is not necessary to call array as udfs can use more than one parameter (Function: newFunction
in the example below).最后但同样重要的是,没有必要调用数组,因为 udfs 可以使用多个参数(下例中的函数: newFunction
)。
import numpy as np
from pyspark.sql.functions import udf, array
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import Vectors, VectorUDT
result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], ["count","df","docs"])
def newFunctionVector(arr):
return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
@udf("float")
def newFunctionWithArray(arr):
returnValue = (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
return returnValue.item()
@udf("float")
def newFunction(count, df, docs):
returnValue = (1 + np.log(count)) * np.log(docs/df)
return returnValue.item()
vector_udf = udf(newFunctionVector, VectorUDT())
result=result.withColumn("new_function_result", newFunction("count","df","docs"))
result=result.withColumn("new_function_result_WithArray", newFunctionWithArray(array("count","df","docs")))
result=result.withColumn("new_function_result_Vector", newFunctionWithArray(array("count","df","docs")))
result.printSchema()
result.show()
Output: Output:
root
|-- count: long (nullable = true)
|-- df: long (nullable = true)
|-- docs: long (nullable = true)
|-- new_function_result: float (nullable = true)
|-- new_function_result_WithArray: float (nullable = true)
|-- new_function_result_Vector: float (nullable = true)
+-----+---+----+-------------------+-----------------------------+--------------------------+
|count| df|docs|new_function_result|new_function_result_WithArray|new_function_result_Vector|
+-----+---+----+-------------------+-----------------------------+--------------------------+
| 138| 5| 10| 4.108459| 4.108459| 4.108459|
| 128| 4| 10| 5.362161| 5.362161| 5.362161|
| 112| 3| 10| 6.8849173| 6.8849173| 6.8849173|
| 120| 3| 10| 6.967983| 6.967983| 6.967983|
| 189| 1| 10| 14.372153| 14.372153| 14.372153|
+-----+---+----+-------------------+-----------------------------+--------------------------+
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.