简体   繁体   English

Apache Spark的工人python

[英]Apache Spark's worker python

After installing Apache Spark on 3 nodes on top of Hadoop, I encountered the following problems:在Hadoop之上的3个节点上安装Apache Spark后,我遇到了以下问题:
Problem 1- Python version:问题 1- Python 版本:
I had problem with setting the python on workers.我在工作人员上设置 python 时遇到问题。 This is the setting in.bashrc file and the same setting is in the spark-env.sh file.这是.bashrc 文件中的设置,spark-env.sh 文件中的设置相同。

alias python3='/usr/bin/python3'
export PYSPARK_PYTHON=/usr/bin/python3
export PYSPARK_DRIVER_PYTHON=/usr/bin/python3

In the Spark logs (yarn logs --applicationId <app_id>) I could see that everything is as expected:在 Spark 日志(纱线日志 --applicationId <app_id>)中,我可以看到一切都按预期进行:

export USER="hadoop"
export LOGNAME="hadoop"
export PYSPARK_PYTHON="python3"

While I installed the pandas library (pip install pandas) on master and worker nodes and made sure it is installed, I constantly received the following message when used the command /home/hadoop/spark/bin/spark-submit --master yarn --deploy-mode cluster sparksql_recommender_system_2.py当我在主节点和工作节点上安装 pandas 库(pip install pandas)并确保它已安装时,我在使用命令/home/hadoop/spark/bin/spark-submit --master yarn --deploy-mode cluster sparksql_recommender_system_2.py

ModuleNotFoundError: No module named 'pandas' <br>

Surprisingly this error was just in cluster mode and I didn't have that error in client deployment mode.令人惊讶的是,这个错误只是在集群模式下,而我在客户端部署模式下没有这个错误。

The command which python returns /usr/bin/python in which the library pandas exists. which python返回/usr/bin/python的命令,其中存在库 pandas。 After 2 days I couldn't find my answer on web. 2 天后,我在 web 上找不到我的答案。 By chance, I tried installing pandas using sudo and it worked:).偶然地,我尝试使用 sudo 安装 pandas 并且它有效:)。

sudo pip install pandas

However, what I expected was that Spark is going to use the python in /usr/bin/python for the hadoop user, not the root user.但是,我所期望的是 Spark 将为 hadoop 用户而不是 root 用户使用/usr/bin/python中的 python。 How can I fix it?我该如何解决?

Problem 2- different behavior of VScode ssh问题 2-VScode ssh 的不同行为
I use VScode ssh addon to connect to a server on which I develop my codes.我使用 VScode ssh 插件连接到我开发代码的服务器。 When do it from one host (PC) I can use spark-submit , but on my other PC I have to use the exact path /home/hadoop/spark/bin/spark-submit .当从一台主机(PC)执行此操作时,我可以使用spark-submit ,但在我的另一台 PC 上,我必须使用确切的路径/home/hadoop/spark/bin/spark-submit It is strange because I use VSCode ssh to the same server and files.这很奇怪,因为我将 VSCode ssh 用于相同的服务器和文件。 Any idea how I can solve it?知道我该如何解决吗?

Here's a great discussion on how to package item s up so that your python environment is transferred to executor.这是一个关于如何 package 项目的精彩讨论,以便将您的 python 环境转移到执行程序。

Create the environment创建环境

conda create -y -n pyspark_conda_env -c conda-forge pyarrow pandas conda-pack conda activate pyspark_conda_env conda pack -f -o pyspark_conda_env.tar.gz

Ship it:装运它:

 export PYSPARK_DRIVER_PYTHON=python # Do not set in cluster modes. export PYSPARK_PYTHON=./environment/bin/python spark-submit --archives pyspark_conda_env.tar.gz#environment app.py

This does have the disadvantage of having to be shipped every time but it's the safest, and least hassle way to do it.这确实有每次都必须运送的缺点,但这是最安全、最省事的方法。 Installing everything on each node is 'faster' but comes with a higher threshold of managment and I suggest avoiding it.在每个节点上安装所有东西“更快”,但管理门槛更高,我建议避免使用它。

All that said... get off the Pandas.说了这么多……下车 Pandas。 Pandas does python things(small data). Pandas 做 python 的事情(小数据)。 Spark Data Frames do Spark things(Big Data). Spark Data Frames 做 Spark 的事情(大数据)。 I hope it was just an illustrative example and you aren't going to use Pandas.(It's not a bad. it's just made for small data so use it for small data,) If you "have to" use it.我希望这只是一个说明性示例,您不会使用 Pandas。(这还不错。它只是为小数据制作的,所以将它用于小数据,)如果您“必须”使用它。 look into Koala's that does a translation to allow you to ask panda things of spark data frames.查看 Koala's 的翻译,让您可以向 panda 询问 spark 数据帧的事情。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM