Why this method using putchar_unlocked is slower than printf and cout to print strings?

Question

I'm studying manners of speedup my codes for programming's competitions, using as base acceleration of input and output processing.

I'm currently using a thread-unsafe putchar_unlocked function to print some tests. I believed that this function was faster than cout e printf to some data types if well implemented due to its thread-unlockable nature.

I implemented a function to print strings this way (very simple, at my point of view):

void write_str(char s[], int n){
    int i;
    for(i=0;i<n;i++)
        putchar_unlocked(s[i]);
}

I tested with a string of size n and exactly n characters.
But it is the slowest of three, how we can see in this graph of number of output writes versus time in seconds:

Why it's the slowest?

Answer 1

Assuming the time measurements for up to about 1,000,000 million characters is below a measurement threshold and the writes to std::cout and stdout are made using a form using bulk-writes (eg std::cout.write(str, size) ), I'd guess that putchar_unlock() spends most of its time actually updating some part of the data structures in addition to putting the character. The other bulk-writes would copy the data into a buffer in bulk (eg, using memcpy() ) and update the data structures internally just once.

That is, the codes would look something like this (this is pidgeon-code, ie, just roughly showing what's going on; the real code would be, at least, slightly more complicated):

int putchar_unlocked(int c) {
    *stdout->put_pointer++ = c;
    if (stdout->put_pointer != stdout->buffer_end) {
        return c;
    }
    int rc = write(stdout->fd, stdout->buffer_begin, stdout->put_pointer - stdout->buffer_begin);
    // ignore partial writes
    stdout->put_pointer = stdout->buffer_begin;
    return rc == stdout->buffer_size? c: EOF;
}

The bulk-version of the code are instead doing something along the lines of this (using C++ notation as it is easier being a C++ developer; again, this is pidgeon-code):

int std::streambuf::write(char const* s, std::streamsize n) {
    std::lock_guard<std::mutex> guard(this->mutex);
    std::streamsize b = std::min(n, this->epptr() - this->pptr());
    memcpy(this->pptr(), s, b);
    this->pbump(b);
    bool success = true;
    if (this->pptr() == this->epptr()) {
        success = this->this->epptr() - this->pbase()
            != write(this->fd, this->pbase(), this->epptr() - this->pbase();
        // also ignoring partial writes
        this->setp(this->pbase(), this->epptr());
        memcpy(this->pptr(), s + b, n - b);
        this->pbump(n - b);
    }
    return success? n: -1;
}

The second code may look a bit more complicated but is only executed once for 30 characters. A lot of the checking is moved out of the interesting bit. Even if there is some locking done, it is is locking an uncontended mutex and will not inhibit the processing much.

Especially when not doing any profiling the loop using putchar_unlocked() will not be optimized much. In particular, the code won't get vectorized which causes an immediate factor of at least about 3 but probably even closer to 16 on the acutal loop. The cost for the lock will quickly diminish.

BTW, just to create reasonably level playground: aside from optimizing you should also call std::sync_with_stdio(false) when using C++ standard stream objects.

Answer 2

Choosing the faster way to output strings comes into conflict with the platform, operating system, compiler settings and runtime library in use, but there are some generalizations which may help understand what to select.

First, consider that the operating system may have a means of display strings as compared to characters one at a time, and if so, looping through a system call for character output one at a time would naturally invoke overhead for every call to the system, as opposed to the overhead of one system call processing a character array.

That's basically what you're encountering, the overhead of a system call.

The performance enhancement of putchar_unlocked, compared to putchar, may be considerable, but only between those two functions. Further, most runtime libraries do not have putchar_unlocked (I find it on older MAC OS X documentation, but not Linux or Windows).

That said, locked or unlocked, there would still be overhead for each character that may be eliminated for a system call processing the entire character array, and such notions extend to output to files or other devices, not just the console.

Answer 3

My personal guess is that printf() does it in chunks, and only has to pass the app/kernel boundary occasionally for each chunk.

putchar_unlocked() does it for every byte written.

Why this method using putchar_unlocked is slower than printf and cout to print strings?

Question

3 answers

solution1
3 2015-09-19 20:39:17

solution2
2 ACCPTED 2015-09-19 20:31:51

solution3
1 2015-09-20 02:10:59

Why this method using putchar_unlocked is slower than printf and cout to print strings?

Question

3 answers

solution1 3 2015-09-19 20:39:17

solution2 2 ACCPTED 2015-09-19 20:31:51

solution3 1 2015-09-20 02:10:59

solution1
3 2015-09-19 20:39:17

solution2
2 ACCPTED 2015-09-19 20:31:51

solution3
1 2015-09-20 02:10:59