为什么使用putchar_unlocked的此方法比printf和cout的打印字符串要慢？

Question

I'm studying manners of speedup my codes for programming's competitions, using as base acceleration of input and output processing. 我正在研究以编程和竞赛为基础加速输入和输出处理的方法。

I'm currently using a thread-unsafe putchar_unlocked function to print some tests. 我目前正在使用线程不安全的putchar_unlocked函数来打印一些测试。 I believed that this function was faster than cout e printf to some data types if well implemented due to its thread-unlockable nature. 我相信，如果函数实现得当，由于该函数具有线程可解锁的特性，那么该函数比cout e printf更快。

I implemented a function to print strings this way (very simple, at my point of view): 我实现了一个以这种方式打印字符串的函数（根据我的观点，这非常简单）：

void write_str(char s[], int n){
    int i;
    for(i=0;i<n;i++)
        putchar_unlocked(s[i]);
}

I tested with a string of size n and exactly n characters. 我使用大小为n且正好为n个字符的字符串进行了测试。
But it is the slowest of three, how we can see in this graph of number of output writes versus time in seconds: 但这是三个中最慢的一个，在此图中如何看到输出写入次数与时间（以秒为单位）的关系：

Why it's the slowest? 为什么最慢？

Answer 1

Assuming the time measurements for up to about 1,000,000 million characters is below a measurement threshold and the writes to std::cout and stdout are made using a form using bulk-writes (eg std::cout.write(str, size) ), I'd guess that putchar_unlock() spends most of its time actually updating some part of the data structures in addition to putting the character. 假设最多约1万亿个字符的时间测量值低于测量阈值，并且使用批量写入的形式（例如std::cout.write(str, size) ）对std::cout和stdout进行写操作，我猜想putchar_unlock()花费大部分时间来实际更新数据结构的某些部分，而不是放置字符。 The other bulk-writes would copy the data into a buffer in bulk (eg, using memcpy() ) and update the data structures internally just once. 其他批量写入操作会将数据批量复制到缓冲区中（例如，使用memcpy() ），并在内部仅一次更新数据结构。

That is, the codes would look something like this (this is pidgeon-code, ie, just roughly showing what's going on; the real code would be, at least, slightly more complicated): 也就是说，代码看起来像这样（这是pidgeon代码，即，仅大致显示正在发生的事情；实际代码至少要稍微复杂一些）：

int putchar_unlocked(int c) {
    *stdout->put_pointer++ = c;
    if (stdout->put_pointer != stdout->buffer_end) {
        return c;
    }
    int rc = write(stdout->fd, stdout->buffer_begin, stdout->put_pointer - stdout->buffer_begin);
    // ignore partial writes
    stdout->put_pointer = stdout->buffer_begin;
    return rc == stdout->buffer_size? c: EOF;
}

The bulk-version of the code are instead doing something along the lines of this (using C++ notation as it is easier being a C++ developer; again, this is pidgeon-code): 相反，代码的批量版本正在按照这种方式做一些事情（使用C ++表示法，因为它更容易成为C ++开发人员；这又是pidgeon代码）：

int std::streambuf::write(char const* s, std::streamsize n) {
    std::lock_guard<std::mutex> guard(this->mutex);
    std::streamsize b = std::min(n, this->epptr() - this->pptr());
    memcpy(this->pptr(), s, b);
    this->pbump(b);
    bool success = true;
    if (this->pptr() == this->epptr()) {
        success = this->this->epptr() - this->pbase()
            != write(this->fd, this->pbase(), this->epptr() - this->pbase();
        // also ignoring partial writes
        this->setp(this->pbase(), this->epptr());
        memcpy(this->pptr(), s + b, n - b);
        this->pbump(n - b);
    }
    return success? n: -1;
}

The second code may look a bit more complicated but is only executed once for 30 characters. 第二个代码可能看起来更复杂，但对于30个字符仅执行一次。 A lot of the checking is moved out of the interesting bit. 很多检查都移到了有趣的地方。 Even if there is some locking done, it is is locking an uncontended mutex and will not inhibit the processing much. 即使完成了一些锁定，它仍在锁定无竞争的互斥锁，并且不会过多地抑制处理。

Especially when not doing any profiling the loop using putchar_unlocked() will not be optimized much. 尤其是当不进行任何分析时，使用putchar_unlocked()的循环不会得到太多优化。 In particular, the code won't get vectorized which causes an immediate factor of at least about 3 but probably even closer to 16 on the acutal loop. 特别是，代码不会进行矢量化处理，这会导致人工循环上的立即因子至少约为3，但可能甚至接近16。 The cost for the lock will quickly diminish. 锁的费用将迅速减少。

BTW, just to create reasonably level playground: aside from optimizing you should also call std::sync_with_stdio(false) when using C++ standard stream objects. 顺便说一句，只是为了创建合理级别的游乐场：除了优化之外，在使用C ++标准流对象时，还应该调用std::sync_with_stdio(false) 。

Answer 2

Choosing the faster way to output strings comes into conflict with the platform, operating system, compiler settings and runtime library in use, but there are some generalizations which may help understand what to select. 选择更快的输出字符串的方法会与所使用的平台，操作系统，编译器设置和运行时库发生冲突，但是有些概括可以帮助理解选择的内容。

First, consider that the operating system may have a means of display strings as compared to characters one at a time, and if so, looping through a system call for character output one at a time would naturally invoke overhead for every call to the system, as opposed to the overhead of one system call processing a character array. 首先，考虑到与一次字符相比，操作系统可能具有一种显示字符串的方式，如果是这样，一次遍历一次用于字符输出的系统调用自然会为系统的每次调用带来开销，与一个系统调用处理字符数组的开销相反。

That's basically what you're encountering, the overhead of a system call. 基本上，这就是您遇到的系统调用的开销。

The performance enhancement of putchar_unlocked, compared to putchar, may be considerable, but only between those two functions. 与putchar相比，putchar_unlocked的性能增强可能是相当大的，但仅在这两个函数之间。 Further, most runtime libraries do not have putchar_unlocked (I find it on older MAC OS X documentation, but not Linux or Windows). 此外，大多数运行时库都没有putchar_unlocked（我在较早的MAC OS X文档中找到了，但在Linux或Windows上却没有）。

That said, locked or unlocked, there would still be overhead for each character that may be eliminated for a system call processing the entire character array, and such notions extend to output to files or other devices, not just the console. 也就是说，无论是锁定还是未锁定，每个字符仍然会有开销，可以在处理整个字符数组的系统调用中消除这些开销，并且这些概念可以扩展到输出到文件或其他设备，而不仅仅是控制台。

Answer 3

My personal guess is that printf() does it in chunks, and only has to pass the app/kernel boundary occasionally for each chunk. 我个人的猜测是，printf（）会以块的形式进行操作，只需要偶尔为每个块传递应用程序/内核边界。

putchar_unlocked() does it for every byte written. putchar_unlocked（）对写入的每个字节执行此操作。

为什么使用putchar_unlocked的此方法比printf和cout的打印字符串要慢？

问题描述

3 个解决方案

解决方案1
3 2015-09-19 20:39:17

解决方案2
2 已采纳 2015-09-19 20:31:51

解决方案3
1 2015-09-20 02:10:59

为什么使用putchar_unlocked的此方法比printf和cout的打印字符串要慢？

问题描述

3 个解决方案

解决方案1 3 2015-09-19 20:39:17

解决方案2 2 已采纳 2015-09-19 20:31:51

解决方案3 1 2015-09-20 02:10:59

解决方案1
3 2015-09-19 20:39:17

解决方案2
2 已采纳 2015-09-19 20:31:51

解决方案3
1 2015-09-20 02:10:59