Practical limits on prefetching a lot of reference data

Question

I am considering doing the following:

Write a daemon process that creates about 50KB of reference data and stores it in an array in shared memory.
The daemon process will have its affinity set to one core on a socket, and will periodically call __builtin_prefetch() on the address of every 64th byte of reference data, to keep all the reference data in the L3 cache for all processes running on other cores on the same socket.
Multiple times per second, application processes will index into the array to retrieve whatever reference datum they need at that time. Since the data will be in the L3 cache, access time will be relatively fast.

I assume I'm not the first person to come up with such an idea. Can anybody offer advice on limitations I might run into? For example, consider the following pseudocode in the daemon process for keeping the reference data in the cache:

for (size_t i = 0; i < sizeof(referenceData); i += 64) {
    __builtin_prefetch(i + (void*)&referenceData);
}

For 50KB of reference data, the above loop would call __builtin_prefetch() 800 times in rapid succession. Would doing that cause problems, such as latency spikes when other applications try to accesses memory (other than the reference data)? If so, I could insert a sleep statement into the for-loop:

for (size_t i = 0; i < sizeof(referenceData); i += 64) {
    __builtin_prefetch(i + (char*)&referenceData);
    if (i % (64*10)) { // sleep every 10th time around the loop
        sleep_briefly();
    }
}

Advice and links to relevant sources of document are appreciated.

Edit to add additional information based on comments :

The reference data will be unchanging. Other processes will access a tiny subset of the data at an application-level event: probably about 7 indexes into the data, each index retrieving 4 bytes, thus retrieving about 28 bytes per event.
I don't think it will be possible to predict which data entries are most likely to be accessed, so I would like to keep the entire reference data in cache rather than just a small subset of it.
If latency did not matter, then there would not be a need for the cached reference data, since each application could re-compute whatever it needed on an as-needed basis for each event. However, latency of responding to events does matter.
I haven't developed all the application code yet, but I am expecting "respond to an event" time of less than 200ns without this optimisation. If this optimisation works out well, then it might reduce the "respond to an event" time to less than 100ns.
The events might occur perhaps as often as a few hundred times per second or as infrequently as once every few seconds. So my concern is that if the reference data is not actively kept warm in the cache, occasionally it will be flushed out of the cache due to lack of use.

Answer 1

A better and simpler solution is for the users of the reference data to load/cache that data in advance as they see fit.

Your process stomping the CPU cache doesn't appear reasonable at all.

On Intel you can use Cache Allocation Technology to reserve certain amount of L3 cache for your applications.

Practical limits on prefetching a lot of reference data

Question

1 answers

solution1
2 2019-06-18 14:46:44

Practical limits on prefetching a lot of reference data

Question

1 answers

solution1 2 2019-06-18 14:46:44

solution1
2 2019-06-18 14:46:44