从 PySpark 库中引用 Spark (Scala) 库

Question

I will be building a Python library for PySpark clients.我将为 PySpark 客户端构建一个 Python 库。 This library will be calling a Spark (Scala) library that I already built and have in production.这个库将调用我已经构建并投入生产的 Spark (Scala) 库。 As motivation (and, perhaps, sanity-check), the Python library they will be calling would look something like this:作为动机（也许还有健全性检查），他们将调用的 Python 库看起来像这样：

from pyspark.sql.DataFrame import PyDataFrame

def process(python_data_frame):
    sc = python_data_frame.rdd.context
    sql_context = python_data_frame.sql_ctx    
    processed_scala_df = sc._jvm.com.mayonesa.ScalaClass.process(python_data_frame._jdf)

    return PyDataFrame(processed_scala_df, sql_context)

I would like to make importing/using this library as painless as possible to my PySpark customers.我想让我的 PySpark 客户尽可能轻松地导入/使用这个库。 How would I reference my Scala project as a dependency to/within this Python library?我将如何引用我的 Scala 项目作为此 Python 库的依赖项/在此库中？ I would like to avoid them having to add attributes (ie, --jars ) to the spark-submit command.我想避免他们必须向spark-submit命令添加属性（即--jars ）。

Answer 1

--jars or --packages is the typical way to go with 3rd party libraries (like yours). --jars或--packages是 go 与 3rd 方库（如您的）的典型方式。

If you want to make their experience of using your libraries less painful, you might want to wrap spark-submit command with all the extra parameters into a wrapper script, which would definitely make it much easier to call.如果您想让他们使用您的库的体验不那么痛苦，您可能希望将带有所有额外参数的spark-submit命令包装到包装脚本中，这肯定会使调用更容易。

从 PySpark 库中引用 Spark (Scala) 库

问题描述

1 个解决方案

解决方案1
1 2021-05-13 03:16:40

从 PySpark 库中引用 Spark (Scala) 库

问题描述

1 个解决方案

解决方案1 1 2021-05-13 03:16:40

解决方案1
1 2021-05-13 03:16:40