简体   繁体   中英

Ensuring logs on Hadoop EMR

I have a long running Hadoop streaming job on Amazon EMR (15nodes, >1.5hours). The job fails at about 75% completion level. I am using Python for both mapper and reducer.

I have made the following optimization:

sys.stdout = os.fdopen(sys.stdout.fileno(), 'w', 0)
sys.stderr = os.fdopen(sys.stderr.fileno(), 'w', 0)

logging.getLogger().setLevel(logging.INFO)

Also I added the following just after issuing log entries with the logging module:

    sys.stderr.flush()
    time.sleep(30)
    sys.exit(3)

to try and catch errors to no avail: Hadoop log files do not show my errors :(

How can I get Hadoop to log my messages and not drop any???

I'm not 100% about the python solution but I know when using the EMR command line interface you have to specify the logging URI in Amazon S3.

For example

./elastic-mapreduce --create --other-options --log-uri s3n://emr.test/logs

This is specified when the cluster is launched. Then under the logs directory on S3 the following directories are created

/jobflowid
   /daemons
   /jobs
   /nodes
   /steps
   /task-attempts

Under /steps you get a folder each individual job and below this the job's stderr, stdout, and controller output are written here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM