简体   繁体   English

无法通过 ssh 使用 spark-submit 开始工作(在 EC2 上)

[英]unable to start a job using spark-submit via ssh (on EC2)

I set up spark on a single EC2 machine and, when I am connected to it, I am able to use spark either with jupyter or spark-submit, without any issue.我在一台 EC2 机器上设置了 spark,当我连接到它时,我可以通过 jupyter 或 spark-submit 使用 spark,没有任何问题。 Unfortunately, though, I am not able to use spark-submit via ssh.但不幸的是,我无法通过 ssh 使用 spark-submit。

So, to recap:所以,回顾一下:

  • This works:这有效:

     ubuntu@ip-198-43-52-121:~$ spark-submit job.py
  • This does not work:这不起作用:

     ssh -i file.pem ubuntu@blablablba.compute.amazon.com "spark-submit job.py"

Initially, I kept getting the following error message over and over:最初,我不断收到以下错误消息:

'java.io.IOException: Cannot run program "python": error=2, No such file or directory' 'java.io.IOException:无法运行程序“python”:错误=2,没有这样的文件或目录'

After having read many articles and posts about this issue, I thought that the problem was due to some variables not having been set properly, so I added the following lines to the machine's.bashrc file:在阅读了很多关于这个问题的文章和帖子后,我认为问题是由于一些变量没有正确设置,所以我在机器的.bashrc文件中添加了以下几行:

export SPARK_HOME=/home/ubuntu/spark-3.0.1-bin-hadoop2.7 #(it's where i unzipped the spark file)
export PATH=$SPARK_HOME/bin:$PATH
export PYTHONPATH=/usr/bin/python3
export PYSPARK_PYTHON=python3

(As the error message referenced python, I also tried adding the line "alias python=python3" to.bashrc, but nothing changed) (由于错误消息引用了 python,我也尝试将“alias python=python3”行添加到.bashrc,但没有任何改变)

After all this, if I try to submit the spark job via ssh I get the following error message:毕竟,如果我尝试通过 ssh 提交火花作业,我会收到以下错误消息:

"command spark-submit not found". “找不到命令火花提交”。

As it looks like the system ignores all the environment variables when sending commands via SSH, I decided to source the machine's.bashrc file before trying to run the spark job.由于看起来系统在通过 SSH 发送命令时忽略了所有环境变量,因此我决定在尝试运行 spark 作业之前获取机器的 .bashrc 文件。 As I was not sure about the most appropriate way to send multiple commands via SSH, I tried all the following ways:由于我不确定通过 SSH 发送多个命令的最合适方式,因此我尝试了以下所有方式:

ssh -i file.pem ubuntu@blabla.compute.amazon.com "source .bashrc; spark-submit job.file"


ssh -i file.pem ubuntu@blabla.compute.amazon.com << HERE
source .bashrc
spark-submit job.file
HERE 


ssh -i file.pem ubuntu@blabla.compute.amazon.com <<- HERE
source .bashrc
spark-submit job.file
HERE


(ssh -i file.pem ubuntu@blabla.compute.amazon.com "source .bashrc; spark-submit job.file")

All attempts worked with other commands like ls or mkdir, but not with source and spark-submit.所有尝试都与其他命令(如 ls 或 mkdir)一起工作,但不适用于 source 和 spark-submit。

I have also tried providing the full path running the following line:我还尝试提供运行以下行的完整路径:

ssh -i file.pem ubuntu@blabla.compute.amazon.com "/home/ubuntu/spark-3.0.1-bin-hadoop2.7/bin/spark-submit job.py"

In this case too I get, once again, the following message:在这种情况下,我也再次收到以下消息:

'java.io.IOException: Cannot run program "python": error=2, No such file or directory' 'java.io.IOException:无法运行程序“python”:错误=2,没有这样的文件或目录'

How can I tell spark which python to use if SSH seems to ignore all environment variables, no matter how many times I set them?如果 SSH 似乎忽略了所有环境变量,无论我设置了多少次,我如何告诉 spark 使用哪个 python?

It's worth mentioning I have got into coding and data a bit more than a year ago, so I am really a newbie here and any help would be highly appreciated.值得一提的是,我在一年多以前就开始涉足编码和数据领域了,所以我真的是这里的新手,任何帮助都将不胜感激。 The solution may be very simple, but I cannot get my head around it.解决方案可能非常简单,但我无法理解它。 Please help.请帮忙。

Thanks a lot in advance:)提前非常感谢:)

The problem was indeed with the way I was expecting the shell to work (which was wrong).问题确实与我期望 shell 工作的方式有关(这是错误的)。

My issue was solved by:我的问题通过以下方式解决:

  1. Setting my variables in.profile instead of.bashrc在.profile 而不是.bashrc 中设置我的变量
  2. Providing full path to python提供 python 的完整路径

Now I can launch spark jobs via ssh.现在我可以通过 ssh 启动 spark 作业。

I found the solution in the answer @VinkoVrsalovic gave to this post:我在@VinkoVrsalovic 给这篇文章的答案中找到了解决方案:

Why does an SSH remote command get fewer environment variables then when run manually? 为什么 SSH 远程命令获得的环境变量比手动运行时少?

Cheers干杯

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM