简体   繁体   English

将 udf 应用于多个列并使用 numpy 操作

[英]apply udf to multiple columns and use numpy operations

I have a dataframe named result in pyspark and I want to apply a udf to create a new column as below:我在 pyspark 中有一个名为 dataframe 的结果,我想应用一个 udf 来创建一个新列,如下所示:

result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)]).withColumnRenamed("_1","count").withColumnRenamed("_2","df").withColumnRenamed("_3","docs")
@udf("float")
def newFunction(arr):
    return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])

result=result.withColumn("new_function_result",newFunction_udf(array("count","df","docs")))

the column count,df,docs all are integer columns.but this returns列数、df、docs 都是 integer 列。但这会返回

Py4JError: An error occurred while calling z:org.apache.spark.sql.functions.col. Py4JError:调用 z:org.apache.spark.sql.functions.col 时出错。 Trace: py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:214) at java.lang.Thread.run(Thread.java:748) Trace: py4j.Py4JException: Method col([class java.util.ArrayList]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318) at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:339 ) at py4j.Gateway.invoke(Gateway.java:274) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run (GatewayConnection.java:214) 在 java.lang.Thread.run(Thread.java:748)

When I try passing one column and getting squares of those it works fine.当我尝试通过一列并获得其中的正方形时,它工作正常。

Any help is appreciated.任何帮助表示赞赏。

The error message is misleading, but is trying to tell you that your function doesn't return a float.该错误消息具有误导性,但试图告诉您您的 function 不返回浮点数。 Your function returns value of type numpy.float64 which you can fetch with the VectorUDT type (Function: newFunctionVector in the example below).您的 function 返回numpy.float64类型的值,您可以使用 VectorUDT 类型获取该值(函数:下面示例中的newFunctionVector )。 Another way to make use of numpy is by casting the numpy type numpy.float64 to the python type float (Function: newFunctionWithArray in the example below). Another way to make use of numpy is by casting the numpy type numpy.float64 to the python type float (Function: newFunctionWithArray in the example below).

Last but not least, it is not necessary to call array as udfs can use more than one parameter (Function: newFunction in the example below).最后但同样重要的是,没有必要调用数组,因为 udfs 可以使用多个参数(下例中的函数: newFunction )。

import numpy as np
from pyspark.sql.functions import udf, array
from pyspark.sql.types import FloatType
from pyspark.mllib.linalg import Vectors, VectorUDT

result = sqlContext.createDataFrame([(138,5,10), (128,4,10), (112,3,10), (120,3,10), (189,1,10)], ["count","df","docs"])

def newFunctionVector(arr):
    return (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])

@udf("float")
def newFunctionWithArray(arr):
    returnValue = (1 + np.log(arr[0])) * np.log(arr[2]/arr[1])
    return returnValue.item()

@udf("float")
def newFunction(count, df, docs):
    returnValue = (1 + np.log(count)) * np.log(docs/df)
    return returnValue.item()


vector_udf = udf(newFunctionVector, VectorUDT())

result=result.withColumn("new_function_result", newFunction("count","df","docs"))

result=result.withColumn("new_function_result_WithArray", newFunctionWithArray(array("count","df","docs")))

result=result.withColumn("new_function_result_Vector", newFunctionWithArray(array("count","df","docs")))

result.printSchema()

result.show()

Output: Output:

root 
|-- count: long (nullable = true) 
|-- df: long (nullable = true) 
|-- docs: long (nullable = true) 
|-- new_function_result: float (nullable = true) 
|-- new_function_result_WithArray: float (nullable = true) 
|-- new_function_result_Vector: float (nullable = true)

+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|count| df|docs|new_function_result|new_function_result_WithArray|new_function_result_Vector|
+-----+---+----+-------------------+-----------------------------+--------------------------+ 
|  138|  5|  10|           4.108459|                     4.108459|                  4.108459| 
|  128|  4|  10|           5.362161|                     5.362161|                  5.362161|
|  112|  3|  10|          6.8849173|                    6.8849173|                 6.8849173|
|  120|  3|  10|           6.967983|                     6.967983|                  6.967983|
|  189|  1|  10|          14.372153|                    14.372153|                 14.372153|  
+-----+---+----+-------------------+-----------------------------+--------------------------+

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 PySpark:如何将 UDF 应用于多个列以创建多个新列? - PySpark: How to apply UDF to multiple columns to create multiple new columns? 如何基于 pandas 中的单个列对多个列应用多个操作? - How to apply multiple operations on multiple columns based on a single column in pandas? 在 for 循环中使用 udf 在 Pyspark 中创建多个列 - use udf inside for loop to create multiple columns in Pyspark 如何一次在多个列上使用Apply功能 - How to use apply function on multiple columns at once 如何在单个groupby()中的熊猫的多个长列上应用多种类型的聚合操作? - How to apply multiple types of aggregation operations on multiple long list of columns in pandas in a single groupby()? 如何使用groupby将多个函数应用于Pandas中的多个列? - How to use groupby to apply multiple functions to multiple columns in Pandas? numpy矩阵中不同列的不同操作? - Different operations for different columns in a numpy matrix? numpy:对NxM数组的列(或行)进行操作 - Numpy: operations on columns (or rows) of NxM array Pyspark:在UDF中传递多个列以及一个参数 - Pyspark: Pass multiple columns along with an argument in UDF 如何将不同的操作应用于矩阵的不同列? - How to apply different operations to different columns of a matrix?
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM