简体   繁体   English

python日志性能比较和选项

[英]python logging performance comparison and options

I am researching high performance logging in Python and so far have been disappointed by the performance of the python standard logging module - but there seem to be no alternatives.我正在研究 Python 中的高性能日志记录,到目前为止,我对 Python 标准日志记录模块的性能感到失望 - 但似乎没有其他选择。 Below is a piece of code to performance test 4 different ways of logging:下面是一段代码,用于测试 4 种不同的日志记录方式:

import logging
import timeit
import time
import datetime
from logutils.queue import QueueListener, QueueHandler
import Queue
import threading

tmpq = Queue.Queue()

def std_manual_threading():
    start = datetime.datetime.now()
    logger = logging.getLogger()
    hdlr = logging.FileHandler('std_manual.out', 'w')
    logger.addHandler(hdlr)
    logger.setLevel(logging.DEBUG)
    def logger_thread(f):
        while True:
            item = tmpq.get(0.1)
            if item == None:
                break
            logging.info(item)
    f = open('manual.out', 'w')
    lt = threading.Thread(target=logger_thread, args=(f,))
    lt.start()
    for i in range(100000):
        tmpq.put("msg:%d" % i)
    tmpq.put(None)
    lt.join()
    print datetime.datetime.now() - start

def nonstd_manual_threading():
    start = datetime.datetime.now()
    def logger_thread(f):
        while True:
            item = tmpq.get(0.1)
            if item == None:
                break
            f.write(item+"\n")
    f = open('manual.out', 'w')
    lt = threading.Thread(target=logger_thread, args=(f,))
    lt.start()
    for i in range(100000):
        tmpq.put("msg:%d" % i)
    tmpq.put(None)
    lt.join()
    print datetime.datetime.now() - start


def std_logging_queue_handler():
    start = datetime.datetime.now()
    q = Queue.Queue(-1)

    logger = logging.getLogger()
    hdlr = logging.FileHandler('qtest.out', 'w')
    ql = QueueListener(q, hdlr)


    # Create log and set handler to queue handle
    root = logging.getLogger()
    root.setLevel(logging.DEBUG) # Log level = DEBUG
    qh = QueueHandler(q)
    root.addHandler(qh)

    ql.start()

    for i in range(100000):
        logging.info("msg:%d" % i)
    ql.stop()
    print datetime.datetime.now() - start

def std_logging_single_thread():
    start = datetime.datetime.now()
    logger = logging.getLogger()
    hdlr = logging.FileHandler('test.out', 'w')
    logger.addHandler(hdlr)
    logger.setLevel(logging.DEBUG)
    for i in range(100000):
        logging.info("msg:%d" % i)
    print datetime.datetime.now() - start

if __name__ == "__main__":
    """
    Conclusion: std logging about 3 times slower so for 100K lines simple file write is ~1 sec while std
    logging ~3. If threads are introduced some overhead causes to go to ~4 and if QueueListener and events
    are used with enhancement for thread sleeping that goes to ~5 (probably because log records are being
    inserted into queue).
    """
    print "Testing"
    #std_logging_single_thread() # 3.4
    std_logging_queue_handler() # 7, 6, 7 (5 seconds with sleep optimization)
    #nonstd_manual_threading() # 1.08
    #std_manual_threading() # 4.3
  1. The nonstd_manual_threading option works best since there is no overhead of the logging module but obviously you miss out on a lot of features such as formatters, filters and the nice interface nonstd_manual_threading 选项效果最好,因为没有日志模块的开销,但显然你错过了很多功能,如格式化程序、过滤器和漂亮的界面
  2. The std_logging in a single thread is the next best thing but still about 3 times slower than nonstd manual threading.单线程中的 std_logging 是下一个最好的事情,但仍然比非标准手动线程慢大约 3 倍。
  3. The std_manual_threading option dumps messages into a threadsafe queue and in a separate thread uses the standard logging module. std_manual_threading 选项将消息转储到线程安全队列中,并在单独的线程中使用标准日志记录模块。 That comes out to be about 25% higher than option 2, probably due to context switching costs.这比选项 2 高出约 25%,可能是由于上下文切换成本。
  4. Finally, the option using "logutils"'s QueueHandler comes out to be the most expensive.最后,使用“logutils”的 QueueHandler 的选项是最昂贵的。 I tweaked the code of logutils/queue.py's _monitor method to sleep for 10 millis after processing 500 messages as long as there are less than 100K messages in the queue.我调整了 logutils/queue.py 的 _monitor 方法的代码,使其在处理 500 条消息后休眠 10 毫秒,只要队列中的消息少于 100K。 That brings the runtime down from 7 seconds to 5 seconds (probably due to avoiding context switching costs).这将运行时间从 7 秒缩短到 5 秒(可能是由于避免了上下文切换成本)。

My question is, why is there so much performance overhead with the logging module and are there any alternatives?我的问题是,为什么日志模块有如此多的性能开销,还有其他选择吗? Being a performance sensitive app does it even make sense to use the logging module?作为一个性能敏感的应用程序,使用日志模块是否有意义?

ps: I have profiled the different scenarios and seems like LogRecord creation is expensive. ps:我已经分析了不同的场景,似乎创建 LogRecord 很昂贵。

The stdlib logging package provides a lot of flexibility and functionality for developers / devops / support staff, and that flexibility comes at some cost, obviously. stdlib logging包为开发人员/ devops /支持人员提供了很多灵活性和功能,显然这种灵活性需要付出一些代价。 If the need for performance trumps the need for flexibility, you need to go with something else. 如果对性能的需求胜过灵活性的需要,那么你需要选择别的东西。 Did you take the steps to optimise described in the docs ? 您是否采取了文档中描述优化步骤? A typical logging call takes of the order of tens of microseconds on reasonable hardware, which hardly seems excessive. 典型的日志记录调用在合理的硬件上需要几十微秒的量级,这几乎不会过多。 However, logging in tight loops is seldom advisable, if only because the amount of info generated might take too much time to wade through. 但是,如果只是因为生成的信息量可能需要花费太多时间才能完成,那么很少建议记录紧密循环。

The code to find the caller can be quite expensive, but is needed if you want eg filename and line number where the logging call was made. 查找调用者的代码可能非常昂贵,但如果您需要例如进行日志记录调用的文件名和行号,则需要该代码。

QueueHandler is intended for scenarios where the logging I/O will take significant time and can't be done in-band. QueueHandler适用于日志记录I / O占用大量时间且无法在带内完成的情况。 For example, a web application whose logs need to be sent by email to site administrators cannot risk using SMTPHandler directly, because the email handshake can be slow. 例如,需要通过电子邮件将日志发送给站点管理员的Web应用程序不能直接使用SMTPHandler ,因为电子邮件握手可能很慢。

Don't forget that thread context switching in Python is slow. 不要忘记Python中的线程上下文切换很慢。 Did you try SocketHandler ? 你尝试过SocketHandler吗? There is a suitable starting point in the docs for a separate receiver process that does the actual I/O to file, email etc. So your process is only doing socket I/O and not doing context switches just for logging. 在文档中有一个合适的起点用于单独的接收器进程,它可以对文件,电子邮件等进行实际的I / O操作。因此,您的进程只进行套接字I / O,而不是仅用于日志记录的上下文切换。 And using domain sockets or UDP might be faster still, though the latter is of course lossy. 使用域套接字或UDP可能会更快,尽管后者当然是有损的。

There are other ways to optimise. 还有其他方法可以优化。 For example, standard handlers in logging do locking around emit() , for thread safety - if in a specific scenario under your control there is no contention for the handler, you could have a handler subclass that no-ops the lock acquisition and release. 例如,日志记录中的标准处理程序会对emit()进行锁定,以确保线程安全 - 如果在您控制的特定场景中没有对处理程序的争用,则可以使用一个处理器子类来禁止锁定获取和释放。 And so on. 等等。

If you want a better answer try to describe your problem in more detail, why you need such a huge number of messages to log? 如果您想要更好的答案,请尝试更详细地描述您的问题,为什么需要如此大量的消息来记录? Logging was designed to record important information, especially warnings and errors, not every line you execute. 记录旨在记录重要信息,尤其是警告和错误,而不是您执行的每一行。

If logging takes more than 1% of your processing time, probably you are using it wrongly and that's not logging fault. 如果日志记录占用处理时间的1%以上,可能是您错误地使用了它并且没有记录错误。

Second, related to performance: do not build the message before sending it to logging module (replace format % params with format command params). 其次,与性能有关:在将消息发送到日志记录模块之前不要构建消息(用format命令params替换format%params)。 This is because logging does this for you, but much faster. 这是因为日志记录会为您执行此操作,但速度要快得多。

Python is not truly multi-threaded in a traditional sense. Python 并不是传统意义上的真正多线程。 Whenever a thread is executing it has to own the gil (global interpreter lock).每当线程执行时,它必须拥有 gil(全局解释器锁)。 "threads" yield whenever they call into the system or have to wait on IO. “线程”在调用系统或必须等待 IO 时产生。 This allows the interpreter thread to run other python "threads".这允许解释器线程运行其他python“线程”。 This equates to asynchronous I/O.这等同于异步 I/O。

Regardless of if the result of the logging message is used or dropped all of the work to evaluate the arguments for the logging message is done.无论日志消息的结果是使用还是丢弃,所有评估日志消息参数的工作都已完成。 As mentioned in other responses.正如其他回复中提到的。 However what is missed (and where the multi-threaded part of you question comes in) is that while writing a large amount to disk may be slow since modern computers have many cores the process of writing the output to the file will be farmed out to another core while the interpreter moves on to another python "thread".然而,被遗漏的(以及您问题的多线程部分出现的地方)是,虽然将大量写入磁盘可能会很慢,因为现代计算机有许多内核,将输出写入文件的过程将被转移到另一个核心,而解释器移动到另一个 python“线程”。 The operating system will complete the async disk write and little to no time will be lost to the actual disc write.操作系统将完成异步磁盘写入,实际磁盘写入几乎没有时间损失。

As long as the interpreter always has another thread to switch to virtually no time will be lost to the writes.只要解释器​​总是有另一个线程切换到几乎没有时间会浪费在写入上。 The interpreter will only actually lose time if all python "threads" are blocked on I/O.如果所有 python “线程”在 I/O 上被阻塞,解释器实际上只会浪费时间。 Which is not likely the case unless you are really swamping your disk.除非您真的要淹没磁盘,否则不太可能出现这种情况。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM