Why is multithreaded slower?

Question

So I am trying to write a program that finds prime numbers. The real purpose of the project is just to learn about multithreading. First I wrote a single thread program and it finds up to 13,633,943 in 1 minute. My multithreaded version only got to 10,025,627.

Here is my code for the single threaded program

#include <iostream>

using namespace std;

bool isprime(long num)
{
    long lim = num/2;
    if(num == 1)
    {
        return 0;
    }
    for(long i = 2; i <= lim; i++)
    {
        if (num % i == 0)
        {
            return 0;
        }
        else{ lim = num/i; }
    }
    return 1;
}

int main()
{
    long lim;
    cout << "How many numbers should I test: ";
    cin >> lim;
    for(long i = 1; i <= lim || lim == 0; i++)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
}

Here is my code for my multithreaded program.

extern"C"
{
    #include <pthread.h>
    #include <unistd.h>
}
#include <iostream>

using namespace std;

bool isprime(long num);
void * iter1(void * arg);
void * iter2(void * arg);
void * iter3(void * arg);
void * iter4(void * arg);


int main()
{
    //long lim;
    //cout << "How many numbers should I test: ";
    //cin >> lim;
    pthread_t t1;
    char mem1[4096];//To avoid false sharing. Needed anywhere else?
    pthread_t t2;
    char mem2[4096];//These helped but did not solve problem.
    pthread_t t3;
    pthread_create(&t1, NULL, iter1, NULL);
    pthread_create(&t2, NULL, iter2, NULL);
    pthread_create(&t3, NULL, iter3, NULL);
    iter4(0);
}

bool isprime(long num)
{
    long lim = num/2;
    if(num == 1)
    {
        return 0;
    }
    for(long i = 2; i <= lim; i++)
    {
        if (num % i == 0)
        {
            return 0;
        }
        else{ lim = num/i; }
    }
    return 1;
}

void * iter1(void * arg)
{
    for(long i = 1;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

void * iter2(void * arg)
{
    for(long i = 2;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

void * iter3(void * arg)
{
    for(long i = 3;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

void * iter4(void * arg)
{
    for(long i = 4;; i = i + 4)
    {
        if(isprime(i))
        {
            cout << i << endl;
        }
    }
return 0;
}

Something that especially confuses me is that system monitor reports 25% CPU usage for the single thread and 100% usage for the multithread. Shouldn't that mean it is doing 4 times as many calculation?

Answer 1

I'm fairly sure cout acts a shared resource - and even if it actually prints each number correctly and in the right order, it slows things down VERY much to do so.

I have done something similar (it is more flexible, and uses an atomic operation to "pick the next number"), and it's almost exactly 4x faster on my quad core machine. But that's only if I don't print anything. If it prints to the console, it's a lot slower - because a lot of the time is used shuffling pixels rather than actually calculating.

Comment out the cout << i << endl; line, and it will run much quicker.

Edit: using my test program, with printing:

Single thread: 15.04s. 
Four threads: 11.25s

Without printing:

Single threads: 12.63s.
Four threads: 3.69s.

3.69 * 4 = 14.76s, but the time command on my Linux machine shows 12.792s total runtime, so there is obviously a little bit of time when all threads aren't running - or some accounting errors...

Answer 2

I think a lot of your current problem is that you're taking the part that can really operate multi-threaded (finding the primes) and burying it in noise (the time to write the output to the console).

To get an idea of how much effect this has, I rewrote your main a little bit to separate printing the primes from finding the primes. To make timing easier, I also had it take the limit from the command line instead of interactively, giving this:

int main(int argc, char **argv) {
    if (argc != 2) {
        std::cerr << "Usage: bad_prime <limit:long>\n";
        return 1;
    }
    std::vector<unsigned long> primes;

    unsigned long lim = atol(argv[1]);

    clock_t start = clock();

    for(unsigned long i = 1; i <= lim; i++)
        if(isprime(i))
            primes.push_back(i);
    clock_t stop = clock();

    for (auto a : primes)
        std::cout << a << "\t";

    std::err << "\nTime to find primes: " << double(stop-start)/CLOCKS_PER_SEC << "\n";
}

Skipping the thousands of lines of the primes themselves, I get a result like this:

Time to find primes: 0.588


Real    48.206
User    1.68481
Sys     3.40082

So -- roughly half a second to find the primes, and over 47 seconds to print them. Assuming the intent really is to write the output to the console, we might as well stop right there. Even if multithreading could completely eliminate the time to find the primes, we'd still only change the ultimate time from ~48.2 seconds to ~47.6 seconds -- unlikely to be worthwhile.

For the moment, therefore, I'll assume the real intent is to write the output to something like a file. Since it seems pretty pointless to go to the work of making code multi-threaded, but run horribly inefficient code in each thread, I thought I'd optimize (or, at least, de-pessimize) the single-threaded code as a starting point.

First, I removed the endl and replaced it with "\\n" . With the output directed to a file, this reduced the run-time from 0.968 seconds to 0.678 seconds -- endl flushes the buffer in addition to writing a newline, and that buffer flushing accounted for roughly one third of the time taken by program overall.

On the same basis, I took the liberty of rewriting your isprime to something that's at least a little less inefficient:

bool isprime(unsigned long num) {
    if (num == 2)
        return true;

    if(num == 1 || num % 2 == 0)
        return false;

    unsigned long lim = sqrt(num);

    for(unsigned long i = 3; i <= lim; i+=2)
        if (num % i == 0)
            return false;

    return true;
}

This is certainly open to more improvement (eg, sieve of Eratosthenes), but it's simple, straightforward, and around two to three times as fast (the times above are based on using this isprime , not yours).

At this point, multithreading the prime finding at least stands a chance of making some sense: with the prime finding taking roughly .5 out of .6 seconds, even if we can only double the speed, we should see a significant difference in overall time.

Separating the output from the prime finding also gives us a much better basis for writing a multi-threaded version of the code. With each thread writing its results to a separate vector, we can get meaningful (not interleaved) output without having to do locking on cout and such -- we compute each chunk separately, then print out each vector in order.

Code for that could look something like this:

#include <iostream>
#include <vector>
#include <time.h>
#include <math.h>
#include <thread>

using namespace std;

bool isprime(unsigned long num) {
    // same as above
}

typedef unsigned long UL;

struct params { 
    unsigned long lower_lim;
    unsigned long upper_lim;
    std::vector<unsigned long> results;

    params(UL l, UL u) : lower_lim(l), upper_lim(u) {}
};

long thread_func(params *p) { 
    for (unsigned long i=p->lower_lim; i<p->upper_lim; i++)
        if (isprime(i))
            p->results.push_back(i);
    return 0;
}

int main(int argc, char **argv) {
    if (argc != 2) {
        std::cerr << "Usage: bad_prime <limit:long>\n";
        return 1;
    }

    unsigned long lim = atol(argv[1]);

    params p[] = {
        params(1, lim/4),
        params(lim/4, lim/2),
        params(lim/2, 3*lim/4),
        params(3*lim/4, lim)
    };

    std::thread threads[] = {
        std::thread(thread_func, p), 
        std::thread(thread_func, p+1),
        std::thread(thread_func, p+2),
        std::thread(thread_func, p+3)
    };

    for (int i=0; i<4; i++) {
        threads[i].join();
        for (UL p : p[i].results)
            std::cout << p << "\n";
    }
}

Running this on the same machine as before (a fairly old dual-core processor), I get:

Real    0.35
User    0.639604
Sys     0

This seems to be scaling extremely well. If all we gained from was multi-core computation, we'd expect to see the time to find the primes divide by 2 (I'm running this on a dual-core processor) and the time to write the data to disk remain constant (multithreading isn't going to speed up my hard drive). Based on that, perfect scaling should give us 0.59/2 + 0.1 = 0.40 seconds.

The (admittedly) minor improvement we're seeing beyond that is mostly likely stemming from the fact that we can start writing the data from thread 1 to the disk while threads 2, 3 and 4 are still finding primes (and likewise, start writing the data from thread 2 while 3 and 4 are still computing, and writing data from thread 3 while thread 4 is still computing).

I suppose I should add that the improvement we're seeing is small enough that it could also be simple noise in the timing. I did, however, run both the single- and multi-threaded versions a number of times, and while there's some variation in both, the multi-threaded version is consistently faster than just the improvement in computation speed should account for.

I almost forgot: to get an idea of how much difference this makes in overall speed, I ran a test to see how long it would take to find the primes up to 13,633,943, which your original version found in one minute. Even though I'm almost certainly using a slower CPU (a ~7 year-old Athlon 64 X2 5200+) this version of the code does that in 12.7 seconds.

One final note: at least for the moment, I've left out the padding you'd inserted to prevent false sharing. Based on the times I'm getting, they don't seem to be necessary (or useful).

Answer 3

This depends rather on how many CPUs your code gets given to run on by the OS. Each of these threads is CPU bound so if you have just the one CPU it's going to run one thread for a bit, timeslice it, run the next thread, etc, which won't be any faster and may well be slower, depending on the overhead of a thread swap. And on solaris, at least, it's worth while telling it you want all the threads to run at once.

I've not come across an implementation where output is serialised like is suggested by the other poster. Normally you get output like

235 iisi s  ppprririimmme
ee

so your output may well indicate the O/S is not allocating you multiple threads.

Another issue you might be hitting is that output to a console is incredibly slow compared to output to a file. It may be worth sending the output from your program to a file and seeing how fast it goes like that.

Answer 4

I believe Oli Charlesworth hit it on the head with the hyperthreading problem. I thought hyperthreading was like actually having two cores. It's not. I changed it to use only two threads and I got up to 22,227,421 which is pretty close to twice as fast.

Answer 5

While @MatsPetersson is correct (at least for a POSIX based system, stdout is a shared resource), he doesn't provide a way to fix that problem, so here's how you can eliminate those pesky locks from happening.

POSIX C defines a function, putc_unlocked , which will do exactly the same thing as putc , but without locking (surprise). Using that, then, we can define our own function which will print an integer without locking, and be faster than cout or printf in multithreaded scenarios:

void printint_unlocked(FILE *fptr, int i) {
    static int digits[] = {
        1,
        10,
        100,
        1000,
        10000,
        100000,
        1000000,
        10000000,
        100000000,
        1000000000,
    };

    if (i < 0) {
        putc_unlocked('-', fptr);
        i = -i;
    }

    int ndigits = (int) log10(i);
    while (ndigits >= 0) {
        int digit = (i / (digits[ndigits])) % 10;

        putc_unlocked('0' + digit, fptr);

        --ndigits;
    }
}

Note that it is entirely possible for there to be race conditions with this method, causing numbers to collide in your output. If your algorithm doesn't end up with any collisions, you should still get the performance boost of multithreaded code.

The third and final option (and probably one too complex for your use case) is to create an event queue on yet another thread, and do all printing from that thread, resulting in no race conditions, and no locking issues between threads.

Why is multithreaded slower?

Question

5 answers

solution1
12 2013-06-06 14:27:31

solution2
7 2013-06-06 17:19:43

solution3
1 2013-06-06 14:48:14

solution4
1 2013-06-06 14:49:56

solution5
-2 2013-06-06 14:50:03

Why is multithreaded slower?

Question

5 answers

solution1 12 2013-06-06 14:27:31

solution2 7 2013-06-06 17:19:43

solution3 1 2013-06-06 14:48:14

solution4 1 2013-06-06 14:49:56

solution5 -2 2013-06-06 14:50:03

solution1
12 2013-06-06 14:27:31

solution2
7 2013-06-06 17:19:43

solution3
1 2013-06-06 14:48:14

solution4
1 2013-06-06 14:49:56

solution5
-2 2013-06-06 14:50:03