[英]How to run a PySpark job (with custom modules) on Amazon EMR?
I want to run a PySpark program that runs perfectly well on my (local) machine. 我想运行一个在我的(本地)计算机上运行良好的PySpark程序。
I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI). 我有一个正在运行的Amazon Elastic Map Reduce集群,已安装了所有必需的依赖项(PyPI中的Spark,Python模块)。
Now, how do I run a PySpark job that uses some custom modules? 现在,如何运行使用某些自定义模块的PySpark作业? I have been trying many things for maybe half a day, now, to no avail.
现在,我可能已经尝试了半天了很多事情,但无济于事。 The best command I have found so far is:
到目前为止,我发现的最佳命令是:
/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
--py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py
However, Python fails because it does not find custom_module.py
. 但是,Python失败,因为它没有找到
custom_module.py
。 It seems to try to copy it, though: 不过,似乎尝试复制它:
INFO yarn.Client: Uploading resource s3://bucket/custom_module.py -> hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/custom_module.py
INFO yarn.Client:上传资源s3://bucket/custom_module.py-> hdfs://…:9000 / user / hadoop / .sparkStaging / application_…_0001 / custom_module.py
INFO s3n.S3NativeFileSystem: Opening 's3://bucket/custom_module.py' for reading
INFO s3n.S3NativeFileSystem:打开“ s3://bucket/custom_module.py”进行读取
This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above). 这似乎是一个非常基本的问题,但是网络对此没有任何帮助,包括官方文档(Spark文档似乎暗示上面的命令)。
This is a bug of Spark 1.3.0 . 这是Spark 1.3.0的错误 。
The workaround consists in defining SPARK_HOME
for YARN, even though this should be unnecessary: 解决方法是为YARN定义
SPARK_HOME
,即使这是不必要的:
spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
--conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.