Linux上共享内存写入的周期性延迟峰值

Question

I have the following code: 我有以下代码：

#pragma pack(4)
struct RECORD_HEADER {
uint64_t msgType;
uint64_t rdtsc;
};
struct BODY {
    char content[488];
};
#pragma pack()

class SerializedRDTSC {
public:
    typedef unsigned long long timeunit_t;

    static timeunit_t start(void) {
            unsigned cycles_high, cycles_low;
            __asm__ __volatile__ (  "CPUID\n\t"
                                    "RDTSC\n\t"
                                    "mov %%edx, %0\n\t"
                                    "mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
                                    "%rax", "%rbx", "%rcx", "%rdx");
            return ( (unsigned long long)cycles_low)|( ((unsigned long long)cycles_high)<<32 );
    }

    static timeunit_t end(void) {
            unsigned cycles_high, cycles_low;
            __asm__ __volatile__(   "RDTSCP\n\t"
                                    "mov %%edx, %0\n\t"
                                    "mov %%eax, %1\n\t"
                                    "CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax",
                                    "%rbx", "%rcx", "%rdx");
            return ( (unsigned long long)cycles_low)|( ((unsigned long long)cycles_high)<<32 );
    }

};

char* createSHM() noexcept {
        const auto sharedMemHandle = shm_open("testing", O_RDWR | O_CREAT, 0666);
        if (-1 == sharedMemHandle) {
            std::cout << "failed to open named shared memory: " << std::endl;
            return nullptr;
        }
        constexpr int32_t size = (1 << 26);
        ftruncate(sharedMemHandle, size);
        char* ptr = (char*) mmap(nullptr, size, PROT_READ | PROT_WRITE,
                MAP_SHARED, sharedMemHandle, 0);

        if (MAP_FAILED == ptr) {
            std::cout << errno << std::endl;
            return nullptr;
        }

        const auto rc = fchmod(sharedMemHandle, 0666);
        if (rc == -1) {
            fprintf(stderr,
                    "Can't change permissions to 0666 on shared mem segment: %m\n");
            fflush(stderr);
        }
        return ptr;
}

int main() {
    BODY update;

    srand(time(nullptr));
    char* ptr = createSHM();

    constexpr uint64_t n = 700;
    constexpr uint64_t n2 = 10;
    uint64_t m_data[n * n2];
    memset(m_data, 0, sizeof(m_data));

    uint64_t r = 0;

    for (uint64_t i = 0; i < n; i++) {
        for (uint64_t k = 0; k < n2; k++) {
            // populate the header
            const auto msgType = rand();
            const auto rdtsc = rand();

            // populate the struct randomly
            uint32_t* tmp = reinterpret_cast<uint32_t*>(&update);
            for (uint32_t j = 0; j < sizeof(BODY) / sizeof(uint32_t); j++) {
                const uint32_t v = rand() % 32767;
                tmp[j] = v;
            }

            // write the struct
            const auto s = SerializedRDTSC::start();
            memcpy(ptr, (char*)&msgType, sizeof(uint64_t));
            ptr+= sizeof(uint64_t);
            memcpy(ptr, (char*)&rdtsc, sizeof(uint64_t));
            ptr+= sizeof(uint64_t);
            memcpy(ptr, &update, sizeof(BODY));
            ptr+= sizeof(BODY);
            const auto e = SerializedRDTSC::end();
            m_data[r++] = e - s;
        }
        usleep(249998);
    }

    for (uint32_t i = 0; i < r; i++) {
        std::cout << i << "," << m_data[i] << std::endl;
    }
}

And for some reason, there are periodic latency spike according to the output: 由于某种原因，根据输出有周期性延迟峰值：

I already isolated the core and double-checked with htop to make sure no other processes were using the core. 我已经隔离了核心并用htop进行了双重检查，以确保没有其他进程使用核心。

My machine has an i7 CPU (nothing fancy). 我的机器有一个i7 CPU（没什么特别的）。

And then I tried with an Xeon CPU. 然后我尝试使用Xeon CPU。 The pattern is about the same -- every 7-11 write, there was a spike. 模式大致相同 - 每7-11写一次，有一个尖峰。

With i7 CPU, I compiled with GCC 7.2 with c++17 and ran it on CentOS 7.3. 使用i7 CPU，我用GCC 7.2编译c ++ 17并在CentOS 7.3上运行它。

With Xeon CPU, I compiled with GCC 4.6 with c++0x and ran it on CentOS 6.5. 使用Xeon CPU，我用GCC 4.6编译c ++ 0x并在CentOS 6.5上运行它。

My questions are: 1. Why there were periodic latency spikes? 我的问题是：1。为什么会出现周期性延迟峰值？ (I checked with strace. And I don't see weird system call involved) 2. Any suggestion on how to investigate/understand the spike? （我用strace检查。我没有看到涉及奇怪的系统调用）2。有关如何调查/了解尖峰的任何建议吗？ More for my learning. 更多我的学习。

Thanks in advance! 提前致谢！

PS Yes, some people object to use rdtsc to measure latency because temperature affects TSC. PS是的，有些人反对使用rdtsc来测量延迟，因为温度会影响TSC。 Tho, I don't see any better option as I don't have PTP, and clock_gettime() sometimes will have latency spikes too. 所以，我没有看到任何更好的选择，因为我没有PTP，而clock_gettime（）有时也会有延迟峰值。 If you have any suggestion, it is more than welcome :) 如果您有任何建议，欢迎:)

Answer 1

A memory page is 4K bytes. 内存页面是4K字节。 Every time you start writing on a new page, that page needs mapped into the process address space. 每次开始在新页面上书写时，该页面都需要映射到进程地址空间。 Since the data you're writing every loop is 8 + 8 + 488 = 504 bytes, you'll get a spike every 8 or 9 time thru the loop. 由于您在每个循环中写入的数据是8 + 8 + 488 = 504字节，因此您将通过循环每8或9次获得一个尖峰。

Since the CPU can speculatively prefetch data from memory, the page fault for the 2nd page (which should occur on the 8th loop) occurs one loop earlier than expected, when the hardware prefetcher tries to access the page. 由于CPU可以推测性地从内存中预取数据，因此当硬件预取程序尝试访问页面时，第二页的页面错误（应该发生在第8个循环上）比预期更早发生一个循环。

Linux上共享内存写入的周期性延迟峰值

问题描述

1 个解决方案

解决方案1
3 已采纳 2018-03-14 02:06:08

Linux上共享内存写入的周期性延迟峰值

问题描述

1 个解决方案

解决方案1 3 已采纳 2018-03-14 02:06:08

解决方案1
3 已采纳 2018-03-14 02:06:08