如何在Amazon EMR上运行PySpark作业（带有自定义模块）？

Question

I want to run a PySpark program that runs perfectly well on my (local) machine. 我想运行一个在我的（本地）计算机上运行良好的PySpark程序。

I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI). 我有一个正在运行的Amazon Elastic Map Reduce集群，已安装了所有必需的依赖项（PyPI中的Spark，Python模块）。

Now, how do I run a PySpark job that uses some custom modules? 现在，如何运行使用某些自定义模块的PySpark作业？ I have been trying many things for maybe half a day, now, to no avail. 现在，我可能已经尝试了半天了很多事情，但无济于事。 The best command I have found so far is: 到目前为止，我发现的最佳命令是：

/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
    --py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py

However, Python fails because it does not find custom_module.py . 但是，Python失败，因为它没有找到custom_module.py 。 It seems to try to copy it, though: 不过，似乎尝试复制它：

INFO yarn.Client: Uploading resource s3://bucket/custom_module.py -> hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/custom_module.py INFO yarn.Client：上传资源s3：//bucket/custom_module.py-> hdfs：//…：9000 / user / hadoop / .sparkStaging / application_…_0001 / custom_module.py

INFO s3n.S3NativeFileSystem: Opening 's3://bucket/custom_module.py' for reading INFO s3n.S3NativeFileSystem：打开“ s3：//bucket/custom_module.py”进行读取

This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above). 这似乎是一个非常基本的问题，但是网络对此没有任何帮助，包括官方文档（Spark文档似乎暗示上面的命令）。

Answer 1

This is a bug of Spark 1.3.0 . 这是Spark 1.3.0的错误。

The workaround consists in defining SPARK_HOME for YARN, even though this should be unnecessary: 解决方法是为YARN定义SPARK_HOME ，即使这是不必要的：

spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
               --conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …

如何在Amazon EMR上运行PySpark作业（带有自定义模块）？

问题描述

1 个解决方案

解决方案1
0 已采纳 2015-04-10 07:26:31

如何在Amazon EMR上运行PySpark作业（带有自定义模块）？

问题描述

1 个解决方案

解决方案1 0 已采纳 2015-04-10 07:26:31

解决方案1
0 已采纳 2015-04-10 07:26:31