如何在 Scala Spark 项目中使用 PySpark UDF？

Question

Several people ( 1 , 2 , 3 ) have discussed using a Scala UDF in a PySpark application, usually for performance reasons.有几个人（ 1 、 2 、 3 ）讨论过在 PySpark 应用程序中使用 Scala UDF，通常是出于性能原因。 I am interested in the opposite - using a python UDF in a Scala Spark project.我对相反的情况感兴趣 - 在 Scala Spark 项目中使用 python UDF。

I am particularly interested in building a model using sklearn (and MLFlow ) then efficiently applying that to records in a Spark streaming job.我对使用 sklearn（和MLFlow ）构建模型特别感兴趣，然后将其有效地应用于 Spark 流作业中的记录。 I know I could also host the python model behind a REST API and make calls to that API in the Spark streaming application in mapPartitions , but managing concurrency for that task and setting up the API for hosted model isn't something I'm super excited about.我知道我也可以在 REST API 后面托管 python 模型，并在mapPartitions 的 Spark 流应用程序中调用该 API ，但是管理该任务的并发性并为托管模型设置 API 并不是我非常兴奋的事情关于。

Is this possible without too much custom development with something like Py4J?如果没有太多像 Py4J 这样的定制开发，这可能吗？ Is this just a bad idea?这只是一个坏主意吗？

Thanks!谢谢！

Answer 1

Maybe I'm late to the party, but at least I can help with this for posterity.也许我参加聚会迟到了，但至少我可以为后代提供帮助。 This is actually achievable by creating your python udf and registering it with spark.udf.register("my_python_udf", foo) .这实际上可以通过创建你的python udf spark.udf.register("my_python_udf", foo)并用spark.udf.register("my_python_udf", foo) 。 You can view the doc here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.UDFRegistration.register您可以在此处查看文档https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.UDFRegistration.register

This function can then be called from sqlContext in Python, Scala, Java, R or any language really, because you're accessing sqlContext directly (where the udf is registered).这个函数然后可以从所谓的sqlContext在Python，斯卡拉，爪哇，R还是真的任何语言，因为你所访问sqlContext直接（其中udf注册）。 For example, you would call something like例如，您会调用类似

spark.sql("SELECT my_python_udf(...)").show()

PROS - You get to call your sklearn model from Scala.优点 - 您可以从 Scala 调用您的sklearn模型。

CONS - You have to use sqlContext and write SQL style queries.缺点 - 您必须使用sqlContext并编写SQL样式查询。

I hope this helps, at least for any future visitors.我希望这会有所帮助，至少对任何未来的访问者都是如此。

如何在 Scala Spark 项目中使用 PySpark UDF？

问题描述

1 个解决方案

解决方案1
1 2019-11-25 14:42:43

如何在 Scala Spark 项目中使用 PySpark UDF？

问题描述

1 个解决方案

解决方案1 1 2019-11-25 14:42:43

解决方案1
1 2019-11-25 14:42:43