简体   繁体   English

从Scala将UDF注册到SqlContext以在PySpark中使用

[英]Register UDF to SqlContext from Scala to use in PySpark

Is it possible to register a UDF (or function) written in Scala to use in PySpark ? 是否可以注册用Scala编写的UDF(或函数)以在PySpark中使用? Eg: 例如:

val mytable = sc.parallelize(1 to 2).toDF("spam")
mytable.registerTempTable("mytable")
def addOne(m: Integer): Integer = m + 1
// Spam: 1, 2

In Scala, the following is now possible: 在Scala中,现在可以进行以下操作:

val UDFaddOne = sqlContext.udf.register("UDFaddOne", addOne _)
val mybiggertable = mytable.withColumn("moreSpam", UDFaddOne(mytable("spam")))
// Spam: 1, 2
// moreSpam: 2, 3

I would like to use "UDFaddOne" in PySpark like 我想在PySpark中使用“ UDFaddOne”,例如

%pyspark

mytable = sqlContext.table("mytable")
UDFaddOne = sqlContext.udf("UDFaddOne") # does not work
mybiggertable = mytable.withColumn("+1", UDFaddOne(mytable("spam"))) # does not work

Background: We are a team of developpers, some coding in Scala and some in Python, and would like to share already written functions. 背景:我们是一个开发人员团队,一些使用Scala编码,一些使用Python,并且希望共享已经编写的函数。 It would also be possible to save it into a library and import it. 也可以将其保存到库中并导入。

As far as I know PySpark doesn't provide any equivalent of the callUDF function and because of that it is not possible to access registered UDF directly. 据我所知,PySpark没有提供任何与callUDF函数等效的功能,因此无法直接访问已注册的UDF。

The simplest solution here is to use raw SQL expression: 这里最简单的解决方案是使用原始SQL表达式:

mytable.withColumn("moreSpam", expr("UDFaddOne({})".format("spam")))

## OR
sqlContext.sql("SELECT *, UDFaddOne(spam) AS moreSpam FROM mytable")

## OR
mytable.selectExpr("*", "UDFaddOne(spam) AS moreSpam")

This approach is rather limited so if you need to support more complex workflows you should build a package and provide complete Python wrappers. 这种方法相当有限,因此,如果需要支持更复杂的工作流程,则应构建一个程序包并提供完整的Python包装器。 You'll find and example UDAF wrapper in my answer to Spark: How to map Python with Scala or Java User Defined Functions? 您将在我对Spark的回答中找到并举例说明UDAF包装器:如何使用Scala或Java用户定义函数映射Python?

The following worked for me (basically a summary of multiple places including the link provided by zero323): 以下内容对我有用(基本上是多个地方的摘要,包括zero323提供的链接):

In scala: 在scala中:

package com.example
import org.apache.spark.sql.functions.udf

object udfObj extends Serializable {
  def createUDF = {
    udf((x: Int) => x + 1)
  }
}

in python (assume sc is the spark context. If you are using spark 2.0 you can get it from the spark session): 在python中(假设sc是spark上下文。如果使用spark 2.0,则可以从spark会话中获取它):

from py4j.java_gateway import java_import
from pyspark.sql.column import Column

jvm = sc._gateway.jvm
java_import(jvm, "com.example")
def udf_f(col):
    return Column(jvm.com.example.udfObj.createUDF().apply(col))

And of course make sure the jar created in scala is added using --jars and --driver-class-path 当然,请确保使用--jars和--driver-class-path添加了在scala中创建的jar。

So what happens here: 那么这里发生了什么:

We create a function inside a serializable object which returns the udf in scala (I am not 100% sure Serializable is required, it was required for me for more complex UDF so it could be because it needed to pass java objects). 我们在可序列化的对象内部创建一个函数,该函数在scala中返回udf(我不是100%确定需要Serializable,对于更复杂的UDF,它对于我来说是必需的,这可能是因为它需要传递java对象)。

In python we use access the internal jvm (this is a private member so it could be changed in the future but I see no way around it) and import our package using java_import. 在python中,我们使用访问内部jvm(这是一个私有成员,因此将来可以更改,但我看不到它的周围),然后使用java_import导入我们的包。 We access the createUDF function and call it. 我们访问createUDF函数并调用它。 This creates an object which has the apply method (functions in scala are actually java objects with the apply method). 这将创建一个具有apply方法的对象(scala中的函数实际上是具有apply方法的java对象)。 The input to the apply method is a column. apply方法的输入是一列。 The result of applying the column is a new column so we need to wrap it with the Column method to make it available to withColumn. 应用该列的结果是一个新列,因此我们需要使用Column方法包装它以使其可用于withColumn。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM