简体   繁体   English

从 PySpark 库中引用 Spark (Scala) 库

[英]Referencing a Spark (Scala) library from a PySpark library

I will be building a Python library for PySpark clients.我将为 PySpark 客户端构建一个 Python 库。 This library will be calling a Spark (Scala) library that I already built and have in production.这个库将调用我已经构建并投入生产的 Spark (Scala) 库。 As motivation (and, perhaps, sanity-check), the Python library they will be calling would look something like this:作为动机(也许还有健全性检查),他们将调用的 Python 库看起来像这样:

from pyspark.sql.DataFrame import PyDataFrame

def process(python_data_frame):
    sc = python_data_frame.rdd.context
    sql_context = python_data_frame.sql_ctx    
    processed_scala_df = sc._jvm.com.mayonesa.ScalaClass.process(python_data_frame._jdf)

    return PyDataFrame(processed_scala_df, sql_context)

I would like to make importing/using this library as painless as possible to my PySpark customers.我想让我的 PySpark 客户尽可能轻松地导入/使用这个库。 How would I reference my Scala project as a dependency to/within this Python library?我将如何引用我的 Scala 项目作为此 Python 库的依赖项/在此库中? I would like to avoid them having to add attributes (ie, --jars ) to the spark-submit command.我想避免他们必须向spark-submit命令添加属性(即--jars )。

--jars or --packages is the typical way to go with 3rd party libraries (like yours). --jars--packages是 go 与 3rd 方库(如您的)的典型方式。

If you want to make their experience of using your libraries less painful, you might want to wrap spark-submit command with all the extra parameters into a wrapper script, which would definitely make it much easier to call.如果您想让他们使用您的库的体验不那么痛苦,您可能希望将带有所有额外参数的spark-submit命令包装到包装脚本中,这肯定会使调用更容易。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM