简体   繁体   中英

Why this code extremely slow?Any thing related to cache behavior?

I started to some data-oriented design experiment. I initially start doing some oop code and found some code is extremely slow, don't know why. Here is one example: I have a game object

    class GameObject
    {
    public:
          float m_Pos[2];
          float m_Vel[2];
          float m_Foo;

          void UpdateFoo(float f){
             float mag = sqrtf(m_Vel[0] * m_Vel[0] + m_Vel[1] * m_Vel[1]);
             m_Foo += mag * f;
          }
     };

then I create 1,000,000 of objects using new, and then loop over calling UpdateFoo()

        for (unsigned i=0; i<OBJECT_NUM; ++i)
        {
           v_objects[i]->UpdateFoo(10.0);
        }

it takes about 20ms to finish the loop. And strange things happened when I comment out float m_Pos[2], so the object looks like this

    class GameObject
    {
    public:
          //float m_Pos[2];
          float m_Vel[2];
          float m_Foo;

          void UpdateFoo(float f){
             float mag = sqrtf(m_Vel[0] * m_Vel[0] + m_Vel[1] * m_Vel[1]);
             m_Foo += mag * f;
          }
     };

and suddenly the loop takes about 150ms to finish. And if I put anything before m_Vel, much faster. I try to put some padding between m_Vel and m_Foo or other places except the place before m_Vel....slow.

I tested on vs2008 and vs2010 in release build, i7-4790 Any idea how this difference could happen? Is it related to any cache coherent behavior.

here is whole sample:

    #include <iostream>
    #include <math.h>
    #include <vector>
    #include <Windows.h>

    using namespace std;

    class GameObject
    {
    public:
        //float m_Pos[2];
        float m_Velocity[2];
        float m_Foo;

        void UpdateFoo(float f)
        {
          float mag = sqrtf(m_Velocity[0] * m_Velocity[0] + m_Velocity[1] * 
                            m_Velocity[1]);
          m_Foo += mag * f;
         }
    };



     #define OBJECT_NUM 1000000

     int main(int argc, char **argv)
     {
       vector<GameObject*> v_objects;
       for (unsigned i=0; i<OBJECT_NUM; ++i)
       {
          GameObject * pObject = new GameObject;
          v_objects.push_back(pObject);
       }

       LARGE_INTEGER nFreq;
       LARGE_INTEGER nBeginTime;
       LARGE_INTEGER nEndTime;
       QueryPerformanceFrequency(&nFreq);
       QueryPerformanceCounter(&nBeginTime);

       for (unsigned i=0; i<OBJECT_NUM; ++i)
       {
           v_objects[i]->UpdateFoo(10.0);
       }

       QueryPerformanceCounter(&nEndTime);
       double dWasteTime = (double)(nEndTime.QuadPart-
                       nBeginTime.QuadPart)/(double)nFreq.QuadPart*1000;

       printf("finished: %f", dWasteTime);

       //   for (unsigned i=0; i<OBJECT_NUM; ++i)
       //   {
       //       delete(v_objects[i]);
       //   }
     }

then I create 1,000,000 of objects using new, and then loop over calling UpdateFoo()

There's your problem right there. Don't allocate a million teeny things individually that are going to be processed repeatedly using a general-purpose allocator.

Try storing the objects contiguously or in contiguous chunks. An easy solution is store them all in one big std::vector . To remove in constant time, you can swap the element to remove with the last and pop back. If you need stable indices, you can leave a hole behind to be reclaimed on insertion (can use a free list or stack approach). If you need stable pointers that don't invalidate, deque might be an option combined with the "holes" idea using a free list or separate stack of indices to reclaim/overwrite.

You can also just use a free list allocator and use placement new against it while careful to free using the same allocator and manually invoke the dtor, but that gets messier faster and requires more practice to do well than the data structure approach. I recommend instead to just seek to store your game objects in some big container so that you get back the control over where everything is going to reside in memory and the spatial locality that results.

I tested on vs2008 and vs2010 in release build, i7-4790 Any idea how this difference could happen? Is it related to any cache coherent behavior.

If you are benchmarking and building the project properly, maybe the allocator is fragmenting the memory more when GameObject is smaller where you are incurring more cache misses as a result. That would seem to be the most likely explanation, but difficult to know for sure without a good profiler.

That said, instead of analyzing it further, I recommend the above solution so that you don't have to worry about where the allocator is allocating every teeny thing in memory.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM