Ensuring logs on Hadoop EMR

Question

I have a long running Hadoop streaming job on Amazon EMR (15nodes, >1.5hours). The job fails at about 75% completion level. I am using Python for both mapper and reducer.

I have made the following optimization:

sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
sys.stderr = os.fdopen(sys.stderr.fileno(), 'w', 0)

logging.getLogger().setLevel(logging.INFO)

Also I added the following just after issuing log entries with the logging module:

    sys.stderr.flush()
    time.sleep(30)
    sys.exit(3)

to try and catch errors to no avail: Hadoop log files do not show my errors :(

How can I get Hadoop to log my messages and not drop any???

Answer 1

I'm not 100% about the python solution but I know when using the EMR command line interface you have to specify the logging URI in Amazon S3.

For example

./elastic-mapreduce --create --other-options --log-uri s3n://emr.test/logs

This is specified when the cluster is launched. Then under the logs directory on S3 the following directories are created

/jobflowid
   /daemons
   /jobs
   /nodes
   /steps
   /task-attempts

Under /steps you get a folder each individual job and below this the job's stderr, stdout, and controller output are written here.

Ensuring logs on Hadoop EMR

Question

1 answers

solution1
0 2012-06-14 15:40:17

Ensuring logs on Hadoop EMR

Question

1 answers

solution1 0 2012-06-14 15:40:17

solution1
0 2012-06-14 15:40:17