If we use memory fences to enforce consistency, how does “thread-thrashing” ever occur?

Question

Before I knew of the CPU's store buffer I thought thread-thrashing simply occured when two threads wanted to write to the same cacheline. One would prevent the other from writing. However, this seems pretty synchronous. I later learnt that there is a store buffer, which temporarily flushes the writes. It is forced to flush through the SFENCE instruction, kinda implying there is no synchronous prevention of multiple cores accessing the same cacheline....

I am totally confused how thread-thrashing occurs, if we have to be careful and use SFENCEs? Thread-thrashing implies blocking, whereas SFENCEs implies the writes are done asynchronously and the programmer must manually flush the write??

(My understanding of SFENCEs may be confused too- because I also read the Intel memory model is "strong" and therefore memory fences are only required for string x86 instructions).

Could somebody please remove my confusion?

"Thrashing" meaning multiple cores retrieving the same cpu cacheline and this causing latency overhead for other cores competing for the same cacheline.

Answer 1

So, at least in my vocabulary, thread-thrashing happens when you have something like this:

  // global variable
  int x;

  // Thread 1
  void thread1_code()
  {
    while(!done)
      x++;
  }

  // Thread 2
  void thread2_code()
  {
    while(!done)
      x++;
  }

(This code is of course total nonsense - I'm making it ridiculously simple but pointless to not have complicated code that is complicated to explain what is going on in the thread itself)

For simplicity, we'll assume thread 1 always runs on processor 1, and thread 2 always runs on processor 2 [1]

If you run these two threads on an SMP system - and we've JUST started this code [both threads start, by magic, at almost exactly the same time, not like in a real system, many thousand clock-cycles apart], thread one will read the value of x , update it, and write it back. By now, thread 2 is also running, and it will also read the value of x , update it, and write it back. To do that, it needs to actually ask the other processor(s) "do you have (new value for) x in your cache, if so, can you please give me a copy". And of course, processor 1 will have a new value because it has just stored back the value of x . Now, that cache-line is "shared" (our two threads both have a copy of the value). Thread two updates the value and writes it back to memory. When it does so, another signal is sent from this processor saying "If anyone is holding a value of x , please get rid of it, because I've just updated the value".

Of course, it's entirely possible that BOTH threads read the same value of x , update to the same new value, and write it back as the same new modified value. And sooner or later one processor will write back a value that is lower than the value written by the other processor, because it's fallen behind a bit...

A fence operation will help ensure that the data written to memory has actually got all the way to cache before the next operation happens, because as you say, there are write-buffers to hold memory updates before they actually reach memory. If you don't have a fence instruction, your processors will probably get seriously out of phase, and update the value more than once before the other has had time to say "do you have a new value for x ?" - however, it doesn't really help prevent processor 1 asking for the data from processor 2 and processor 2 immediately asking for it "back", thus ping-ponging the cache-content back and forth as quickly as the system can achieve.

To ensure that ONLY ONE processor updates some shared value, it is required that you use a so called atomic instruction. These special instructons are designed to operate in conjunction with write buffers and caches, such that they ensure that ONLY one processor actually holds an up-to-date value for the cache-line that is being updated, and NO OTHER processor is able to update the value until this processor has completed the update. So you never get "read the same value of x and write back the same value of x " or any similar thing.

Since caches don't work on single bytes or single integer sized things, you can also have "false sharing". For example:

 int x, y;

 void thread1_code()
 {
    while(!done) x++;
 }

 void thread2_code()
 {
    while(!done) y++;
 }

Now, x and y are not actually THE same variable, but they are (quite plausibly, but we can't know for 100% sure) located within the same cache-line of 16, 32, 64 or 128 bytes (depending on processor architecture). So although x and y are distinct, when one processor says "I've just updated x , please get rid of any copies", the other processor will get rid of it's (still correct) value of y at the same time as getting rid of x . I had such an example where some code was doing:

 struct {
    int x[num_threads];
    ... lots more stuff in the same way
 } global_var;

 void thread_code()
 {
    ...
     global_var.x[my_thread_number]++;
    ...
 }

Of course, two threads would then update value right next to each other, and the performance was RUBBISH (about 6x slower than when we fixed it by doing:

struct
{
   int x;
   ... more stuff here ... 
} global_var[num_threads]; 

 void thread_code()
 {
    ...
     global_var[my_thread_number].x++;
    ...
 }

Edit to clarify: fence does not (as my recent edit explains) "help" against ping-poning the cache-content between threads. It also doesn't, in and of itself, prevent data from being updated out of sync between the processors - it does, however, ensure that the processor performing the fence operation doesn't continue doing OTHER memory operations until this particular operations memory content has got "out of" the processor core itself. Since there are various pipeline stages, and most modern CPU's have multiple execution units, one unit may well be "ahead" of another that is technically "behind" in the execution stream. A fence will ensure that "everything has been done here". It's a bit like the man with the big stop-board in Formula 1 racing, that ensures that the driver doesn't drive off from the tyre-change until ALL new tyres are securely on the car (if everyone does what they should).

The MESI or MOESI protocol is a state-machine system that ensures that operations between different processors is done correctly. A processor can have a modified value (in which case a signal is sent to all other processors to "stop using the old value"), a processor may "own" the value (it is the holder of this data, and may modify the value), a processor may have "exclusive" value (it's the ONLY holder of the value, everyone else has got rid of their copy), it may be "shared" (more than one processor has a copy, but this processor should not update the value - it is not the "owner" of the data), or Invalid (data is not present in the cache). MESI doesn't have the "owned" mode which means a little more traffic on the snoop bus ("snoop" means "Do you have a copy of x ", "please get rid of your copy of x " etc)

[1] Yes, processor numbers usually start with zero, but I can't be bothered to go back and rename thread1 to thread0 and thread2 to thread1 by the time I wrote this additional paragraph.

If we use memory fences to enforce consistency, how does “thread-thrashing” ever occur?

Question

1 answers

solution1
4 ACCPTED 2015-02-26 22:07:47

If we use memory fences to enforce consistency, how does “thread-thrashing” ever occur?

Question

1 answers

solution1 4 ACCPTED 2015-02-26 22:07:47

solution1
4 ACCPTED 2015-02-26 22:07:47