Stack vs cache friendly allocator

Question

Some days ago I started to deal with cache friendly code and rolled some different construct to determin how performance is changing if I am placing variables on the stack or the heap and how different memory layouts scale with linear tasks like iterating and searching.

I am not dealing with allocation time, just with the processing performance.

The tests are not to accurate, but at least it shall give some related numbers how the performance might differ.

First of all I compared the performance between an std::array with the performance of a vector.

The test code for the array:

int main()
{
    std::array<mango::int16, 5000000> v;

    mango::delta_timer timer; //simple timer class

    for (int i = 0; 5000000 > i; ++i)
    {
        v[i] = i; //I know that i will overflow but that's no problem in this case
    }

    timer.start();
    mango::for_each(v.begin(), v.end(), [](mango::int16& i)->void {++i; });
    timer.stop();

    std::cout << (double)timer.totalTime();

    mango::mgetch(); /*crossplatform wrapper for _getch() --> supposed to
    give me a point where I can exit the program without printing the results*/

    mango::for_each(v.begin(), v.end(), print); /*print the entire
    vector and hope that this will prevent the compiler from optimizing the array away*/

    return 0;
}

The code for a regular vector:

int main()
{
    std::vector<mango::int16> v;
    v.reserve(5000000);

    mango::delta_timer timer;

    for (int i = 0; 5000000 > i; ++i)
    {
        v.push_back(i);
    }

    timer.start();
    mango::for_each(v.begin(), v.end(), [](mango::int16& i)->void {++i; });
    timer.stop();

    std::cout << (double)timer.totalTime();

    mango::mgetch();

    mango::for_each(v.begin(), v.end(), print);

    return 0;
}

The for_each on the array took something between 0.003 to 0.004 seconds and the for_each on the vector were between 0.005 and 0.007 seconds.

After the first tests I rolled a very slim and minimalistic allocator to try if I could get similar performance with stack memory.

The allocator looks like this:

class block_allocator
{
public:
    block_allocator(mango::int32 n, mango::int32 bsize, mango::int32 id)
        : m_Memory(new mango::byte[n * bsize]), m_Capacity(n), m_BlockSize(bsize), m_ID(id), m_Blocks(n)
    {
        for (mango::byte* iterator = (mango::byte*)m_Memory; ((mango::byte*)m_Memory + n * bsize) > iterator; iterator += bsize)
        {
            m_Blocks.push_back(iterator);
        }
    }

    ~block_allocator()
    {
        delete[](mango::byte*)m_Memory;
        m_Memory = nullptr;
    }

    void* allocate(mango::uint32 n)
    {
        if (m_Blocks.empty())
        {
            throw mango::exception::out_of_range(mango::to_string(m_ID) + std::string(" allocator went out of range"), "out_of_range");
        }

        void* block = m_Blocks.back();
        m_Blocks.pop_back();

        return block;
    }

    void deallocate(void* target)
    {
        if (m_Blocks.size() == m_Capacity)
        {
            delete[](mango::byte*)target;
        }

        m_Blocks.push_back(target);
    }

private:
    void*                m_Memory;

    mango::int32         m_Capacity;
    mango::int32         m_BlockSize;
    mango::int32         m_ID;

    std::vector<void*>   m_Blocks;
};

It's just a very minimalistic sample for testing and it's not suited for productive use!

This is my test sample with the allocator:

int main()
{
    std::array<mango::int16*, 5000000> v;

    mango::delta_timer timer;

    for (int i = 0; 5000000 > i; ++i)
    {
        v[i] = allocate_int(i); //allocates an int with the allocator
    }

    timer.start();
    mango::for_each(v.begin(), v.end(), [](mango::int16* i)->void {++(*i); });
    timer.stop();

    std::cout << (double)timer.totalTime();

    mango::mgetch();

    mango::for_each(v.begin(), v.end(), print);

    return 0;
}

With this example the performance of the for_each was falling between 0.003 and 0.004 just like the first array example.

There is no cleanup on any of this examples, I know.

So here is the question: Since I had to increase the stack size in visual studio 2015 to get this sample running (otherwise a stack overflow would occure) and the simple fact that the stack will get slower with increasing size, what would be the preferable way to go for cache friendly code?

Using a cache friendly allocator which keeps object close together on the heap is reaching equal performance to using the stack (This might differ in different examples but even close to stack performance will be enough for most programs I think).

Wouldn't it be more effective to build a proper allocator and storing the large stuff on the heap and keep the count of "real" allocations low instead of overusing the stack? I ask this because I am reading "use the stack as frequently as you can" very frequently all over the internet and I am concerned that this approach isn't as simple as a lot of people think.

Thank you.

Answer 1

Don't overestimate the value to the cache of keeping everything on the stack. Yes, it's nice for newly allocated objects to fit into lines that are already cached. But on eg Haswell, cache lines are only 64 bytes, so pretty quick you run out of contiguity as far as the cache is concerned. (There is some benefit to cache set distribution, but it's a minor one.) And if you're writing the sort of code where you can actually benefit from extra cache coherence, then you're generally working with large-ish arrays which are contiguous no matter where they are.

The "use the stack, not the heap" advice is, I think, advising you to avoid indirection.

With all that said, there is some minor benefit to a separate allocator that assumes, and benefits from, LIFO allocation patterns. But it comes from the reduced bookkeeping cost, not from cache friendliness.

Stack vs cache friendly allocator

Question

1 answers

solution1
3 ACCPTED 2017-06-13 11:00:53

Stack vs cache friendly allocator

Question

1 answers

solution1 3 ACCPTED 2017-06-13 11:00:53

solution1
3 ACCPTED 2017-06-13 11:00:53