简体   繁体   English

如何在代码中的亚马逊EMR引导动作上安装自定义包?

[英]how to install custom packages on amazon EMR bootstrap action in code?

need to install some packages and binaries on the amazon EMR bootstrap action but I can't find any example that uses this. 需要在亚马逊EMR引导操作上安装一些软件包和二进制文件,但我找不到任何使用它的示例。

Basically, I want to install python package, and specify each hadoop node to use this package for processing the items in s3 bucket, here's a sample frpm boto. 基本上,我想安装python包,并指定每个hadoop节点使用这个包来处理s3桶中的项目,这里是一个示例frpm boto。

                      name='Image to grayscale using SimpleCV python package',
                      mapper='s3n://elasticmapreduce/samples/imageGrayScale.py',
                      reducer='aggregate',
                      input='s3n://elasticmapreduce/samples/input',
                      output='s3n://<my output bucket>/output'

I need to make it use the SimpleCV python package, but not sure where to specify this. 我需要让它使用SimpleCV python包,但不知道在哪里指定它。 What if it is not installed, how do I make it installed? 如果没有安装怎么办,如何安装呢? Is there a way to avoid waiting for the installation to complete, is it possible to install it somewhere and just reference the python package? 有没有办法避免等待安装完成,是否可以在某处安装它并只引用python包?

There is a class boto.emr.bootstrap_action.BootstrapAction for the bootstrap action. 引导操作有一个类boto.emr.bootstrap_action.BootstrapAction

Define it like the below. 像下面这样定义它。 Most of the code is from the boto example page . 大多数代码来自boto示例页面

import boto.emr
from boto.emr.bootstrap_action import BootstrapAction

action = BootstrapAction(name="Bootstrap to add SimpleCV",
                         path="s3n://<my bucket uri>/bootstrap-simplecv.sh")

conn = boto.emr.connect_to_region('us-west-2')
jobid = conn.run_jobflow(name='My jobflow',
                         log_uri='s3://<my log uri>/jobflow_logs',
                         steps=[step],  # step defined elsewhere
                         bootstrap_actions=[action])

And you need to define the bootstrap action. 您需要定义引导操作。 If you need another version of Python then yes, it would save time to precompile it on the exact same computer, tar it, put it in an S3 bucket, and then untar it during the bootstrap. 如果你需要另一个版本的Python,那么它可以节省在同一台计算机上预编译它的时间,tar它,把它放在S3存储桶中,然后在引导程序中解压缩它。

#!/bin/sh
# filename: bootstrap-simplecv.sh  (save it in an S3 bucket)
set -e -x

sudo apt-get install python-setuptools
sudo easy_install pip 
sudo pip install -U SimpleCV

I think you can leave EMR instances spinning from within boto so that the bootstrap only occurs the first time in your session. 我认为你可以让EMR实例从boto中旋转,这样引导程序只会在你的会话中第一次出现。 Just be careful to shut them down before you log out so you don't get a surprise on your bill. 在注销之前要小心关闭它们,这样你就不会对账单感到惊讶。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM