Does the use of an anonymous pipe introduce a memory barrier for interthread communication?

For example, say I allocate a struct with new and write the pointer into the write end of an anonymous pipe.

If I read the pointer from the corresponding read end, am I guaranteed to see the 'correct' contents on the struct?

Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.

The context is a server design which centralizes event dispatch with select/epoll

No. There is no guarantee that the writing CPU will have flushed the write out of its cache and made it visible to the other CPU that might do the read.

Also of of interest is whether the results of socketpair() on unix & self connecting over tcp loopback on windows have the same guarantees.


In practice, calling write() , which is a system call, will end up locking one or more data structures in the kernel, which should take care of the reordering issue. For example, POSIX requires subsequent reads to see data written before their call, which implies a lock (or some kind of acquire/release) by itself.

As for whether that's part of the formal spec of the calls, probably it's not.

A pointer is just a memory address, so provided you are on the same process the pointer will be valid on the receiving thread and will point to the same struct. If you are on different processes, at best you will get immediately a memory error, at worse you will read (or write) to a random memory which is essentially Undefined Behaviour.

Will you read the correct content? Neither better nor worse than if your pointer was in a static variable shared by both threads: you still have to do some synchronization if you want consistency.

Will the kind of transfer address matter between static memory (shared by threads), anonymous pipes, socket pairs, tcp loopback, etc.? No: all those channels transfers bytes , so if you pass a memory address, you will get your memory address. What is left you then is synchronization, because here you are just sharing a memory address.

If you do not use any other synchronization, anything can happen (did I already spoke of Undefined Behaviour?):

  • reading thread can access memory before it has been written by writing one giving stale data
  • if you forgot to declare the struct members as volatile, reading thread can keep using cached values, here again getting stale data
  • reading thread can read partially written data meaning incoherent data

Interesting question with, so far, only one correct answer from Cornstalks.

Within the same (multi-threaded) process there are no guarantees since pointer and data follow different paths to reach their destination. Implicit acquire/release guarantees do not apply since the struct data cannot piggyback on the pointer through the cache and formally you are dealing with a data race.

However, looking at how the pointer and the struct data itself reach the second thread (through the pipe and memory cache respectively), there is a real chance that this mechanism is not going to cause any harm. Sending the pointer to a peer thread takes 3 system calls ( write() in the sending thread, select() and read() in the receiving thread) which is (relatively) expensive and by the time the pointer value is available in the receiving thread, the struct data probably has arrived long before.

Note that this is just an observation, the mechanism is still incorrect.

I believe, your case might be reduced to this 2 threads model:

int data = 0;
std::atomic<int*> atomicPtr{nullptr};

void thread1()
    data = 42;
    atomicPtr.store(&integer, std::memory_order_release);

void thread2()
    int* ptr = nullptr;
        ptr = atomicPtr.load(std::memory_order_consume);
    assert(*ptr == 42);

Since you have 2 processes you can't use one atomic variable across them but since you listed you can omit atomicPtr.load(std::memory_order_consume) from the consuming part because, AFAIK, all the architectures Windows is running on guarantee this load to be correct without any barrier on the loading side. In fact, I think there are not much architectures out there where that instruction would not be a NO-OP(I heard only about DEC Alpha)

I agree with Serge Ballesta's answer. Within the same process, it's feasible to send and receive object address via anonymous pipe.

Since the write system call is guaranteed to be atomic when message size is below PIPE_BUF (normally 4096 bytes), so multi-producer threads will not mess up each other's object address (8 bytes for 64 bit applications).

Talk is cheap, here is the demo code for Linux (defensive code and error handlers are omitted for simplicity). Just copy & paste to pipe_ipc_demo.cc then compile & run the test.

#include <unistd.h>
#include <string.h>
#include <pthread.h>
#include <string>
#include <list>

template<class T> class MPSCQ { // pipe based Multi Producer Single Consumer Queue
        int producerPush(const T* t); 
        T* consumerPoll(double timeout = 1.0);
        void _consumeFd();
        int _selectFdConsumer(double timeout);
        T* _popFront();
        int _fdProducer;
        int _fdConsumer;
        char* _consumerBuf;
        std::string* _partial;
        std::list<T*>* _list;
        static const int _PTR_SIZE;
        static const int _CONSUMER_BUF_SIZE;

template<class T> const int MPSCQ<T>::_PTR_SIZE = sizeof(void*);
template<class T> const int MPSCQ<T>::_CONSUMER_BUF_SIZE = 1024;

template<class T> MPSCQ<T>::MPSCQ() :
        _fdConsumer(-1) {
        _consumerBuf = new char[_CONSUMER_BUF_SIZE];
        _partial = new std::string;     // for holding partial pointer address
        _list = new std::list<T*>;      // unconsumed T* cache
        int fd_[2];
        int r = pipe(fd_);
        _fdConsumer = fd_[0];
        _fdProducer = fd_[1];

template<class T> MPSCQ<T>::~MPSCQ() { /* omitted */ }

template<class T> int MPSCQ<T>::producerPush(const T* t) {
        return t == NULL ? 0 : write(_fdProducer, &t, _PTR_SIZE);

template<class T> T* MPSCQ<T>::consumerPoll(double timeout) {
        T* t = _popFront();
        if (t != NULL) {
                return t;
        if (_selectFdConsumer(timeout) <= 0) {  // timeout or error
                return NULL;
        return _popFront();

template<class T> void MPSCQ<T>::_consumeFd() {
        memcpy(_consumerBuf, _partial->data(), _partial->length());
        ssize_t r = read(_fdConsumer, _consumerBuf, _CONSUMER_BUF_SIZE - _partial->length());
        if (r <= 0) {   // EOF or error, error handler omitted
        const char* p = _consumerBuf;
        int remaining_len_ = _partial->length() + r;
        T* t;
        while (remaining_len_ >= _PTR_SIZE) {
                memcpy(&t, p, _PTR_SIZE);
                remaining_len_ -= _PTR_SIZE;
                p += _PTR_SIZE;
        *_partial = std::string(p, remaining_len_);

template<class T> int MPSCQ<T>::_selectFdConsumer(double timeout) {
        int r;
        int nfds_ = _fdConsumer + 1;
        fd_set readfds_;
        struct timeval timeout_;
        int64_t usec_ = timeout * 1000000.0;
        while (true) {
                timeout_.tv_sec = usec_ / 1000000;
                timeout_.tv_usec = usec_ % 1000000;
                FD_SET(_fdConsumer, &readfds_);
                r = select(nfds_, &readfds_, NULL, NULL, &timeout_);
                if (r < 0 && errno == EINTR) {
                return r;

template<class T> T* MPSCQ<T>::_popFront() {
        if (!_list->empty()) {
                T* t = _list->front();
                return t;
        } else {
                return NULL;

// = = = = = test code below = = = = =

#define _LOOP_CNT    5000000
#define _ONE_MILLION 1000000

struct TestMsg {        // all public
        int _threadId;
        int _msgId;
        int64_t _val;
        TestMsg(int thread_id, int msg_id, int64_t val) :
                _val(val) { };

static MPSCQ<TestMsg> _QUEUE;
static int64_t _SUM = 0;

void* functor_producer(void* arg) {
        int my_thr_id_ = pthread_self();
        TestMsg* msg_;
        for (int i = 0; i <= _LOOP_CNT; ++ i) {
                if (i == _LOOP_CNT) {
                        msg_ = new TestMsg(my_thr_id_, i, -1);
                } else {
                        msg_ = new TestMsg(my_thr_id_, i, i + 1);
        return NULL;

void* functor_consumer(void* arg) {
        int msg_cnt_ = 0;
        int stop_cnt_ = 0;
        TestMsg* msg_;
        while (true) {
                if ((msg_ = _QUEUE.consumerPoll()) == NULL) {
                int64_t val_ = msg_->_val;
                delete msg_;
                if (val_ <= 0) {
                        if ((++ stop_cnt_) >= _PRODUCER_THREAD_NUM) {
                                printf("All done, _SUM=%ld\n", _SUM);
                } else {
                        _SUM += val_;
                        if ((++ msg_cnt_) % _ONE_MILLION == 0) {
                                printf("msg_cnt_=%d, _SUM=%ld\n", msg_cnt_, _SUM);
        return NULL;

int main(int argc, char* const* argv) {
        pthread_t consumer_;
        pthread_create(&consumer_, NULL, functor_consumer, NULL);
        pthread_t producers_[_PRODUCER_THREAD_NUM];
        for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
                pthread_create(&producers_[i], NULL, functor_producer, NULL);
        for (int i = 0; i < _PRODUCER_THREAD_NUM; ++ i) {
                pthread_join(producers_[i], NULL);
        pthread_join(consumer_, NULL);
        return 0;

And here is test result ( 2 * sum(1..5000000) == (1 + 5000000) * 5000000 == 25000005000000 ):

$ g++ -o pipe_ipc_demo pipe_ipc_demo.cc -lpthread
$ ./pipe_ipc_demo    ## output may vary except for the final _SUM
msg_cnt_=1000000, _SUM=251244261289
msg_cnt_=2000000, _SUM=1000708879236
msg_cnt_=3000000, _SUM=2250159002500
msg_cnt_=4000000, _SUM=4000785160225
msg_cnt_=5000000, _SUM=6251640644676
msg_cnt_=6000000, _SUM=9003167062500
msg_cnt_=7000000, _SUM=12252615629881
msg_cnt_=8000000, _SUM=16002380952516
msg_cnt_=9000000, _SUM=20252025092401
msg_cnt_=10000000, _SUM=25000005000000
All done, _SUM=25000005000000

The technique showed here is used in our production applications. One typical usage is the consumer thread acts as a log writer, and worker threads can write log messages almost asynchronously. Yes, almost means sometimes writer threads may be blocked in write() when pipe is full, and this is a reliable congestion control feature provided by OS.

