简体   繁体   English

PySpark 在独立模式下执行作业但使用用户定义的模块?

[英]PySpark execute job in standalone mode but with user defined modules?

I have installed spark on some machine to use it in standalone cluster mode.我在某台机器上安装了 spark 以在独立集群模式下使用它。 So now I have some machines that have for each the same spark build version (Spark 2.4.0 build on hadoop 2.7+).所以现在我有一些机器,每个机器都有相同的 Spark 构建版本(Spark 2.4.0 构建在 hadoop 2.7+ 上)。

I want to use this cluster for parallel data analysis and my language of run is Python so I'm using Pyspark not Spark.我想使用这个集群进行并行数据分析,我的运行语言是 Python,所以我使用的是 Pyspark 而不是 Spark。 I have created some modules of the operations to process the data and give it the form that I want.我创建了一些操作模块来处理数据并为其提供我想要的形式。

However, I don't want to copy manually all this modules that I have created on every machine so I would like to know which option are in PySpark to pass the dependencies so that for every executor I'm sure that the modules are present?但是,我不想手动复制我在每台机器上创建的所有这些模块,所以我想知道 PySpark 中的哪个选项可以传递依赖项,以便对于每个执行程序,我确定这些模块都存在?

I have thought of virtual environments that will be activated and install the modules but I don't know how to do it in Spark Standalone mode, while in YARN manager seems to be this option, but I won't install YARN.我想过将激活并安装模块的虚拟环境,但我不知道如何在Spark Standalone模式下进行,而在YARN manager中似乎是这个选项,但我不会安装YARN。

Ps.附言。 Note: some module use data files like .txt and some dynamic libraries like .dll, .so and I want that they are passed to the executors to.注意:某些模块使用 .txt 等数据文件和 .dll、.so 等一些动态库,我希望将它们传递给执行程序。

A good solution to distribute Spark and your modules is to use Docker Swarm (I hope you have experience with Docker).分发 Spark 和您的模块的一个很好的解决方案是使用Docker Swarm (我希望您有使用 Docker 的经验)。

Try to give a look at this repository, it was very useful for me at the time https://github.com/big-data-europe/docker-spark尝试看看这个存储库,当时它对我非常有用https://github.com/big-data-europe/docker-spark

It is a good base for distributing Spark.它是分发 Spark 的良好基础。 On top of that you can build your own modules.最重要的是,您可以构建自己的模块。 So you create your personal Docker Images to distribute in your Docker Hub and then easily distribute them in your cluster with Docker Swarm因此,您可以创建您的个人 Docker 映像以在您的 Docker Hub 中分发,然后使用 Docker Swarm 在您的集群中轻松分发它们

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 在 Pyspark Shell 中访问用户定义的模块(ModuleNotFoundError: No module named) - Accessing the user defined modules in Pyspark Shell (ModuleNotFoundError: No module named) 使用用户定义的参数通过livy提交pyspark作业 - Submitting pyspark job through livy using user defined parameters 尝试通过数据框在Pyspark中执行用户定义的函数时出错 - Error when trying to execute User Defined Functions in Pyspark over a Dataframe pyspark 在独立模式下的连接被拒绝错误 - connection refused error on pyspark in standalone mode PySpark以单机模式连接MongoDB,在集群模式下失败 - PySpark Connects to MongoDB in Standalone Mode, Fails in Cluster Mode PySpark 应用程序在 Yarn 集群模式和独立模式下提交错误 - PySpark application submitting error on Yarn cluster mode and standalone mode 用户定义的模块存储 - User defined modules storage PySpark 随机森林分类器。 Pred.Show() - org.apache.spark.SparkException:无法执行用户定义的 function - PySpark RandomForestClassifier . Pred.Show() - org.apache.spark.SparkException: Failed to execute user defined function 如何在客户端模式下使用带有独立火花的pyspark加载--jars - how to load --jars with pyspark with spark standalone on client mode 在Python中导入用户定义的模块 - Importing user defined modules in Python
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM