简体   繁体   English

在纱簇上安装火花

[英]Install spark on yarn cluster

I am looking for a guide regarding how to install spark on an existing virtual yarn cluster. 我正在寻找有关如何在现有虚拟纱线簇上安装火花的指南。

I have a yarn cluster consisting of two nodes, ran map-reduce job which worked perfect. 我有一个由两个节点组成的纱线簇,运行了map-reduce工作,效果很好。 Looked for results in log and everything is working fine. 在日志中寻找结果,一切正常。

Now I need to add the spark installation commands and configurations files in my vagrantfile. 现在,我需要在vagrantfile中添加spark安装命令和配置文件。 I can't find a good guide, could someone give me a good link ? 我找不到很好的指南,有人可以给我一个很好的链接吗?

I used this guide for the yarn cluster 我将本指南用于纱线簇

http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation http://www.alexjf.net/blog/distributed-systems/hadoop-yarn-installation-definitive-guide/#single-node-installation

Thanks in advance! 提前致谢!

I don't know about vagrant, but I have installed Spark on top of hadoop 2.6 (in the guide referred to as post-YARN) and I hope this helps. 我不了解无所事事,但是我已经在hadoop 2.6之上安装了Spark(在指南中称为YARN之后),希望对您有所帮助。

Installing Spark on an existing hadoop is really easy, you just need to install it only on one machine. 在现有的Hadoop上安装Spark非常简单,您只需要在一台机器安装它即可。 For that you have to download the one pre-built for your hadoop version from it's official website (I guess you can use the without hadoop version but you need to point it to the direction of hadoop binaries in your system). 为此,您必须从其官方网站上下载针对您的hadoop版本预先构建的版本(我想您可以使用without hadoop版本,但是您需要将其指向系统中hadoop二进制文件的方向)。 Then decompress it: 然后解压缩:

tar -xvf spark-2.0.0-bin-hadoop2.x.tgz -C /opt

Now you only need to set some environment variables. 现在,您只需要设置一些环境变量。 First in your ~/.bashrc (or ~/.zshrc ) you can set SPARK_HOME and add it to your PATH if you want: 首先,在您的~/.bashrc (或~/.zshrc SPARK_HOME )中,您可以设置SPARK_HOME并将其添加到您的PATH如果需要的话:

export SPARK_HOME=/opt/spark-2.0.0-bin-hadoop-2.x
export PATH=$PATH:$SPARK_HOME/bin

Also for this changes to take effect you can run: 为了使这些更改生效,您可以运行:

source ~/.bashrc

Second you need to point Spark to your Hadoop configuartion directories. 其次,您需要将Spark指向Hadoop配置目录。 To do this set these two environmental variables in $SPARK_HOME/conf/spark-env.sh : 为此,请在$SPARK_HOME/conf/spark-env.sh设置以下两个环境变量:

export HADOOP_CONF_DIR=[your-hadoop-conf-dir usually $HADOOP_PREFIX/etc/hadoop]
export YARN_CONF_DIR=[your-yarn-conf-dir usually the same as the last variable]

If this file doesn't exist, you can copy the contents of $SPARK_HOME/conf/spark-env.sh.template and start from there. 如果此文件不存在,则可以复制$SPARK_HOME/conf/spark-env.sh.template ,然后从那里开始。

Now to start the shell in yarn mode you can run: 现在以纱线模式启动外壳,您可以运行:

spark-shell --master yarn --deploy-mode client

(You can't run the shell in cluster deploy-mode) (您不能在cluster部署模式下运行shell)

----------- Update -----------更新

I forgot to mention that you can also submit cluster jobs with this configuration like this (thanks @JulianCienfuegos): 我忘了提到您也可以使用以下配置提交集群作业(感谢@JulianCienfuegos):

spark-submit --master yarn --deploy-mode cluster project-spark.py

This way you can't see the output in the terminal, and the command exits as soon as the job is submitted (not completed). 这样,您将无法在终端中看到输出,并且在提交作业(未完成)后命令将立即退出。

You can also use --deploy-mode client to see the output right there in your terminal but just do this for testing, since the job gets canceled if the command is interrupted (eg you press Ctrl+C , or your session ends) 您也可以使用--deploy-mode client在终端上查看输出,但是只需进行测试即可,因为如果命令中断(例如,按Ctrl+C或会话结束),作业将被取消。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM