[英]mrjob: setup logging on EMR
我正在嘗試使用mrjob在EMR上運行hadoop,並且無法弄清楚如何設置日志記錄(用戶生成的日志在map / reduce步驟中),因此我可以在集群終止后訪問它們。
我試圖使用logging
模塊, print
和sys.stderr.write()
設置日志logging
,但到目前為止沒有運氣。 對我有用的唯一選擇是將日志寫入文件然后SSH機器並讀取它,但它很麻煩。 我希望我的日志轉到stderr / stdout / syslog並自動收集到S3,因此我可以在群集終止后查看它們。
這是帶有日志記錄的word_freq示例:
"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re
import logging
import logging.handlers
import sys
WORD_RE = re.compile(r"[\w']+")
class MRWordFreqCount(MRJob):
def mapper_init(self):
self.logger = logging.getLogger()
self.logger.setLevel(logging.INFO)
self.logger.addHandler(logging.FileHandler("/tmp/mr.log"))
self.logger.addHandler(logging.StreamHandler())
self.logger.addHandler(logging.StreamHandler(sys.stdout))
self.logger.addHandler(logging.handlers.SysLogHandler())
def mapper(self, _, line):
self.logger.info("Test logging: %s", line)
sys.stderr.write("Test stderr: %s\n" % line)
print "Test print: %s" % line
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner(self, word, counts):
yield (word, sum(counts))
def reducer(self, word, counts):
yield (word, sum(counts))
if __name__ == '__main__':
MRWordFreqCount.run()
在所有選項中,唯一真正有用的是使用stderr直接寫入( sys.stderr.write
)或使用帶有StreamHandler的記錄器到stderr。
以后可以在作業完成(成功或出錯)后檢索日志:
[s3_log_uri] / [jobflow-ID] /任務的嘗試/ [作業ID] / [試圖-ID] /標准錯誤
請務必將日志保留在runners.emr.cleanup
配置中。
這是一個登錄stdout(python3)的例子
from mrjob.job import MRJob
from mrjob.job import MRStep
from mrjob.util import log_to_stream, log_to_null
import re
import sys
import logging
log = logging.getLogger(__name__)
WORD_RE = re.compile(r'[\w]+')
class MostUsedWords(MRJob):
def set_up_logging(cls, quiet=False, verbose=False, stream=None):
log_to_stream(name='mrjob', debug=verbose, stream=stream)
log_to_stream(name='__main__', debug=verbose, stream=stream)
def steps(self):
return [
MRStep (mapper = self.mapper_get_words,
combiner = self.combiner_get_words,
reducer = self.reduce_get_words),
MRStep (reducer = self.reducer_find_max)
]
pass
def mapper_get_words(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)
def combiner_get_words(self, word, counts):
yield (word, sum(counts))
def reduce_get_words(self, word, counts):
log.info(word + "\t" +str(list(counts)) )
yield None, (sum(counts), word)
def reducer_find_max(self, key, value):
# value is pairs i.e., tuples
yield max(value)
if __name__ == '__main__':
MostUsedWords.run()
聲明:本站的技術帖子網頁,遵循CC BY-SA 4.0協議,如果您需要轉載,請注明本站網址或者原文地址。任何問題請咨詢:yoyou2525@163.com.