Simple multi-threading confusion for C++

I am developing a C++ application in Qt . I have a very basic doubt, please forgive me if this is too stupid...

How many threads should I create to divide a task amongst them for minimum time?

I am asking this because my laptop is 3rd gen i5 processor (3210m). So since it is dual core & NO_OF_PROCESSORS environment variable is showing me 4 . I had read in an article that dynamic memory for an application is only available for that processor which launched that application. So should I create only 1 thread (since env variable says 4 processors) or 2 threads (since my processor is dual core & env variable might be suggesting the no of cores) or 4 threads (if that article was wrong)? Please forgive me since I am a beginner level programmer trying to learn Qt. Thank You :)

Although hyperthreading is somewhat of a lie (you're told that you have 4 cores, but you really only have 2 cores, and another two that only run on what resources the former two don't use, if there's such a thing), the correct thing to do is still to use as many threads as NO_OF_PROCESSORS tells you .

Note that Intel isn't the only one lying to you, it's even worse on recent AMD processors where you have 6 alleged "real" cores, but in reality only 4 of them, with resources shared among them.

However, most of the time, it just more or less works out. Even in absence of explicitly blocking a thread (on a wait function or a blocking read), there's always a point where a core is stalled, for example in accessing memory due to a cache miss, which gives away resources that can be used by the hyperthreaded core.

Therefore, if you have a lot of work to do, and you can parallelize it nicely, you should really have as many workers as there are advertized cores (whether they're "real" or "hyper"). This way, you make maximum use of the available processor resources.

Ideally, one would create worker threads early at application startup, and have a task queue to hand tasks to workers. Since synchronization is often non-neglegible, the task queue should be rather "coarse". There is a tradeoff in maximum core usage and synchronization overhead.

For example, if you have 10 million elements in an array to process, you might push tasks that refer to 100,000 or 200,000 consecutive elements (you will not want to push 10 million tasks!). That way, you make sure that no cores stay idle on the average (if one finishes earlier, it pulls another task instead of doing nothing) and you only have a hundred or so synchronizations, the overhead of which is more or less neglegible.

If tasks involve file/socket reads or other things that can block for indefinite time, spawning another 1-2 threads is often no mistake (takes a bit of experimentation).

This totally depends on your workload, if you have a workload which is very cpu intensive you should stay closer to the number of threads your cpu has(4 in your case - 2 core * 2 for hyperthreading). A small oversubscription might be also be ok, as that can compensate for times where one of your threads waits for a lock or something else.
On the other side, if your application is not cpu dependent and is mostly waiting, you can even create more threads than your cpu count. You should however notice that thread creation can be quite an overhead. The only solution is to measure were your bottleneck is and optimize in that direction.

Also note that if you are using c++11 you can use std::thread::hardware_concurrency to get a portable way to determine the number of cpu cores you have.

Concerning your question about dynamic memory, you must have misunderstood something there.Generally all threads you create can access the memory you created in your application. In addition, this has nothing to do with C++ and is out of the scope of the C++ standard.

NO_OF_PROCESSORS shows 4 because your CPU has Hyper-threading. Hyper-threading is the Intel trademark for tech that enables a single core to execute 2 threads of the same application more or less at the same time. It work as long as eg one thread is fetching data and the other one accessing the ALU. If both need the same resource and instructions can't be reordered, one thread will stall. This is the reason you see 4 cores, even though you have 2.

That dynamic memory is only available to one of the Cores is IMO not quite right, but register contents and sometimes cache content is. Everything that resides in the RAM should be available to all CPUs.

More threads than CPUs can help, depending on how you operating systems scheduler works / how you access data etc. To find that you'll have to benchmark your code. Everything else will just be guesswork.

Apart from that, if you're trying to learn Qt, this is maybe not the right thing to worry about...


Answering your question: We can't really tell you how much slower/faster your program will run if you increase the number of threads. Depending on what you are doing this will change. If you are eg waiting for responses from the network you could increase the number of threads much more. If your threads are all using the same hardware 4 threads might not perform better than 1. The best way is to simply benchmark your code.

In an ideal world, if you are 'just' crunching numbers should not make a difference if you have 4 or 8 threads running, the net time should be the same (neglecting time for context switches etc.) just the response time will differ. The thing is that nothing is ideal, we have caches, your CPUs all access the same memory over the same bus, so in the end they compete for access to resources. Then you also have an operating system that might or might not schedule a thread/process at a given time.

You also asked for an Explanation of synchronization overhead: If all your threads access the same data structures, you will have to do some locking etc. so that no thread accesses the data in an invalid state while it is being updated.

Assume you have two threads, both doing the same thing:

int sum = 0; // global variable

thread() {
    int i = sum;
    i += 1;
    sum = i;

If you start two threads doing this at the same time, you can not reliably predict the output: It might happen like this:

THREAD A : i = sum; // i = 0
           i += 1;  // i = 1
**context switch**
THREAD B : i = sum; // i = 0
           i += 1;  // i = 1
           sum = i; // sum = 1
**context switch**
THREAD A : sum = i; // sum = 1

In the end sum is 1 , not 2 even though you started the thread twice. To avoid this you have to synchronize access to sum , the shared data. Normally you would do this by blocking access to sum as long as needed. Synchronization overhead is the time that threads would be waiting until the resource is unlocked again, doing nothing.

If you have discrete work packages for each thread and no shared resources you should have no synchronization overhead.

The easiest way to get started with dividing work among threads in Qt is to use the Qt Concurrent framework. Example: You have some operation that you want to perform on every item in a QList (pretty common).

void operation( ItemType & item )
  // do work on item, changing it in place

QList<ItemType> seq;  // populate your list

// apply operation to every member of seq
QFuture<void> future = QtConcurrent::map( seq, operation );
// if you want to wait until all operations are complete before you move on...

Qt handles the threading automatically...no need to worry about it. The QFuture documenation describes how you can handle the map completion asymmetrically with signals and slots if you need to do that.

