简体   繁体   English

如何在 PySpark 中使用 Scala UDF 接受 Map[String, String]

[英]How to use Scala UDF accepting Map[String, String] in PySpark

Based on the discussion from How to use Scala UDF in PySpark?基于如何在 PySpark 中使用 Scala UDF 的讨论? , I am able to execute the UDF from a scala code for Primitive types, but I want to call scala UDF from PySpark which accepts a Map[String, String]. ,我可以从原始类型的 scala 代码执行 UDF,但我想从 PySpark 调用 scala UDF,它接受 Map[String, String]。

package com.test

object ScalaPySparkUDFs extends Serializable {
    def testFunction1(x: Int): Int = { x * 2 }
    def testFunction2(x: Map[String, String]) : String = { // use the Map key and value pair}
    def testUDFFunction1 = udf { x: Int => testFunction1(x) }
    def testUDFFunction2 = udf { x: Map[String, String] => testFunction2(x) }
}

The UDF1 works fine: UDF1 工作正常:

_f = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction1()
Column(_f.apply(_to_seq(sc, [col], _to_java_column)))

But I am not sure how to execute testUDFFunction2 from PySpark:但我不确定如何从 PySpark 执行 testUDFFunction2:

_f2 = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction2() 
Column(_f2.apply(_to_seq(sc, [lit("KEY"), col("FIRSTCOLUMN"), lit("KEY2"), col("SECONDCOLUMN")], _to_java_column)))

This fails and generates the below exception:这失败并生成以下异常:

Py4JJavaError: An error occurred while calling o430.apply.
: java.lang.ClassCastException: sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction2$$Lambda$3693/1231805146 cannot be cast to scala.Function4
    at org.apache.spark.sql.catalyst.expressions.ScalaUDF.<init>(ScalaUDF.scala:241)
    at org.apache.spark.sql.expressions.SparkUserDefinedFunction.createScalaUDF(UserDefinedFunction.scala:113)
    at org.apache.spark.sql.expressions.SparkUserDefinedFunction.apply(UserDefinedFunction.scala:101)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
    at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
    at java.lang.Thread.run(Thread.java:748)

I can easily do that from scala as:我可以从 scala 轻松做到这一点:

val output = input.withColumn("result", testUDFFunction2(map(
      lit("KEY1"), col("FIRSTCOLUMN"),
      lit("KEY2"), col("SECONDCOLUMN")
    )))

But I want convert that code in PySpark, I am not able to find good documentation.但我想在 PySpark 中转换该代码,我找不到好的文档。 As mentioned in https://spark.apache.org/docs/3.2.1/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html , there are only two apply methods which accepts list of Column arguments. As mentioned in https://spark.apache.org/docs/3.2.1/api/java/org/apache/spark/sql/expressions/UserDefinedFunction.html , there are only two apply methods which accepts list of Column arguments. Any recommendations how I can proceed?有什么建议我可以继续吗?

I tried create_map function in pyspark.sql.functions but that doesn't work with Col types.我在 pyspark.sql.functions 中尝试了 create_map function ,但这不适用于 Col 类型。

I can see the problem with how you are calling the function.我可以看到您如何调用 function 的问题。

You need to change the following line:您需要更改以下行:

 _f2 = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction2() 
Column(_f2.apply(_to_seq(sc, [lit("KEY"), col("FIRSTCOLUMN"), lit("KEY2"), col("SECONDCOLUMN")], _to_java_column)))

As, the function can be called using 'map' method in scala, there is an equivalent method 'create_map' in pyspark.因为,function 可以使用 scala 中的“map”方法调用,pyspark 中有一个等效的方法“create_map”。 Only thing you need to do is:您唯一需要做的就是:

from pyspark.sql.functions import create_map
_f2 = sc._jvm.com.test.ScalaPySparkUDFs.testUDFFunction2() 
Column(_f2.apply(_to_seq(sc, [create_map(lit("KEY"), col("FIRSTCOLUMN"), lit("KEY2"), col("SECONDCOLUMN"))], _to_java_column)))

That way, you will be able to call the function and solve ClassCastExceptions.这样,您将能够调用 function 并解决 ClassCastExceptions。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM