简体   繁体   English

如何在Amazon EMR上运行PySpark作业(带有自定义模块)?

[英]How to run a PySpark job (with custom modules) on Amazon EMR?

I want to run a PySpark program that runs perfectly well on my (local) machine. 我想运行一个在我的(本地)计算机上运行良好的PySpark程序。

I have an Amazon Elastic Map Reduce cluster running, with all the necessary dependencies installed (Spark, Python modules from PyPI). 我有一个正在运行的Amazon Elastic Map Reduce集群,已安装了所有必需的依赖项(PyPI中的Spark,Python模块)。

Now, how do I run a PySpark job that uses some custom modules? 现在,如何运行使用某些自定义模块的PySpark作业? I have been trying many things for maybe half a day, now, to no avail. 现在,我可能已经尝试了半天了很多事情,但无济于事。 The best command I have found so far is: 到目前为止,我发现的最佳命令是:

/home/hadoop/spark/bin/spark-submit --master yarn-cluster \
    --py-files s3://bucket/custom_module.py s3://bucket/pyspark_program.py 

However, Python fails because it does not find custom_module.py . 但是,Python失败,因为它没有找到custom_module.py It seems to try to copy it, though: 不过,似乎尝试复制它:

INFO yarn.Client: Uploading resource s3://bucket/custom_module.py -> hdfs://…:9000/user/hadoop/.sparkStaging/application_…_0001/custom_module.py INFO yarn.Client:上传资源s3://bucket/custom_module.py-> hdfs://…:9000 / user / hadoop / .sparkStaging / application_…_0001 / custom_module.py

INFO s3n.S3NativeFileSystem: Opening 's3://bucket/custom_module.py' for reading INFO s3n.S3NativeFileSystem:打开“ s3://bucket/custom_module.py”进行读取

This looks like an awfully basic question, but the web is quite mute on this, including the official documentation (the Spark documentation seems to imply the command above). 这似乎是一个非常基本的问题,但是网络对此没有任何帮助,包括官方文档(Spark文档似乎暗示上面的命令)。

This is a bug of Spark 1.3.0 . 这是Spark 1.3.0错误

The workaround consists in defining SPARK_HOME for YARN, even though this should be unnecessary: 解决方法是为YARN定义SPARK_HOME ,即使这是不必要的:

spark-submit … --conf spark.yarn.appMasterEnv.SPARK_HOME=/home/hadoop/spark \
               --conf spark.executorEnv.SPARK_HOME=/home/hadoop/spark …

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM