關於存儲器布局的直覺，用於快速SIMD /面向數據的設計

Question

我最近一直在關注面向數據的設計會談，但我從未理解他們一致選擇的內存布局背后的原因。

讓我們說我們有一個3D動畫來渲染，在每一幀中我們需要重新規范我們的方向向量。

“標量代碼”

他們總是顯示可能看起來像這樣的代碼：

let scene = [{"camera1", vec4{1, 1, 1, 1}}, ...]

for object in scene
    object.orientation = normalize(object.orientation)

到目前為止一切都那么好......因為&scene的記憶可能大致如此：

[string,X,Y,Z,W,string,X,Y,Z,W,string,X,Y,Z,W,...]

“SSE識別代碼”

每次談話都會顯示改進的， 千篇一律的版本：

let xs = [1, ...]
let ys = [1, ...]
let zs = [1, ...]
let ws = [1, ...]
let scene = [{"camera1", ptr_vec4{&xs[1], &ys[1], &zs[1], &ws[1]}}, ...]

for (o1, o2, o3, o4) in scene
    (o1, o2, o3, o4) = normalize_sse(o1, o2, o3, o4)

由於它的內存布局，它不僅可以提高內存效率，而且還可以一次處理我們的場景4對象。
記憶在&xs ， &ys ， &zs和&ws

[X,X,X,X,X,X,...]
[Y,Y,Y,Y,Y,Y,...]
[Z,Z,Z,Z,Z,Z,...]
[W,W,W,W,W,W,...]

但為什么4個獨立的陣列？

如果__m128 （打包4單打）是引擎中的主要類型，
我相信它是;
如果類型是128位長，
它肯定是;
如果緩存行寬度 / 128 = 4，
它幾乎總是這樣;
如果x86_64只能寫一個完整的緩存行，
我幾乎可以肯定
- 為什么數據的結構不是如下？！

內存在&packed_orientations ：

[X,X,X,X,Y,Y,Y,Y,Z,Z,Z,Z,W,W,W,W,X,X,...]
 ^---------cache-line------------^

~~我沒有基准來測試這個，我甚至不太了解內在函數甚至嘗試~~ ，但我的直覺，如果不是這種方式更快？ 我們將節省4倍的頁面加載和寫入，簡化分配和保存指針，並且代碼將更簡單，因為我們可以執行指針添加而不是4個指針。 我錯了嗎？

謝謝！ :)

Answer 1

無論是執行4個單獨的數組還是建議的交錯，您需要通過內存子系統獲取的數據量都是相同的。 因此，您不保存頁面加載或寫入（我不明白為什么“單獨的數組”情況應該多次讀取和寫入每個頁面或緩存行）。

您可以更多地分散內存傳輸 - 在您的情況下，每次迭代可能會有1個L1緩存未命中，而在“單獨數組”情況下，每4次迭代會有4個緩存未命中。 我不知道哪一個會更受歡迎。

無論如何，重點是沒有不必要的內存通過您不與之交互的緩存。 在您的示例中，具有既不讀取也不寫入但仍然通過緩存推送的string值不必要地占用帶寬。

Answer 2

在矢量寬度上交錯的一個主要缺點是需要更改布局以利用更寬的矢量。 （AVX，AVX512）。

但是，是的，當你純粹手動矢量化（沒有循環，編譯器可以通過選擇矢量寬度自動矢量化）時，如果所有（重要的）循環總是使用所有結構成員，這可能是值得的。

否則Max的要點適用： 僅接觸x和y的循環將浪費z和w成員的帶寬。

這不會是這樣快，雖然 ，使用合理數量的循環展開，索引4個數組或遞增4個指針幾乎不會比1.英特爾CPU上的HW預取可以跟蹤每4k頁面的一個前向+ 1后向流，因此4個輸入流基本上是正常的。

（但是，在Skylake中，L2是4路關聯，從之前的8開始，因此相對於4k頁面，所有4個輸入流都具有相同的對齊，這將導致沖突未命中/失敗預取。因此，超過4個大/頁 - 對齊的數組，交錯格式可以避免這個問題。）

對於小型陣列，整個交錯的東西適合一個4k頁面，是的，這是一個潛在的優勢。 否則它觸及的頁面總數和潛在的TLB未命中的數量大致相同，只有4倍，而不是4個組。如果TLB預取可以提前做一個頁面步行，那么這可能更好。被同時淹沒的多個TLB未命中。

調整SoA結構：

讓編譯器知道每個指針指向的內存不重疊可能會有所幫助。 大多數C ++編譯器（包括所有4個主要x86編譯器，gcc / clang / MSVC / ICC）都支持__restrict作為關鍵字，其語義與C99 restrict相同。 或者為了便攜性，使用#ifdef / #define將restrict關鍵字定義為空或__restrict或其他任何內容，以適合編譯器。

struct SoA_scene {
        size_t size;
        float *__restrict xs;
        float *__restrict ys;
        float *__restrict zs;
        float *__restrict ws;
};

這肯定有助於自動向量化，否則編譯器不知道xs[i] = foo; 不會為下一次迭代更改ys[i+1]的值。

如果您將這些變量讀入局部變量（因此編譯器確保指針賦值不會修改結構中的指針本身），您可以將它們聲明為float *__restrict xs = soa.xs; 等等。

交錯格式固有地避免了這種混疊的可能性。

Answer 3

尚未提及的一件事是內存訪問有相當多的延遲。 當然，當從4個指針讀取時，結果在最后一個值到達時可用。 因此，即使4個值中有3個在緩存中，最后一個值可能需要來自內存並停止整個操作。

這就是SSE甚至不支持這種模式的原因。 所有值都必須在內存中連續，並且很長一段時間它們必須對齊（因此它們無法跨越緩存行邊界）。

重要的是，這意味着您的示例（陣列結構） 在SSE硬件中不起作用 。 您不能在一次操作中使用來自4個不同向量的元素[1] 。 您可以從單個向量使用元素[0]到[3] 。

Answer 4

我已經為這兩種方法實現了一個簡單的基准。

結果：條紋布局最多比標准布局快10％*。 但是使用SSE4.1，我們可以做得更好。

*在i5-7200U cpu上使用gcc -Ofast編譯時。

該結構是稍微容易的工作，但更靈活。 但是，一旦分配器足夠繁忙，它可能在實際場景中具有一點優勢。

條紋布局

Time 4624 ms

Memory usage summary: heap total: 713728, heap peak: 713728, stack peak: 2896
         total calls   total memory   failed calls
 malloc|          3         713728              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          1         640000

#include <chrono>
#include <cstdio>
#include <random>
#include <vector>
#include <xmmintrin.h>

/* -----------------------------------------------------------------------------
        Striped layout [X,X,X,X,y,y,y,y,Z,Z,Z,Z,w,w,w,w,X,X,X,X...]
----------------------------------------------------------------------------- */

using AoSoA_scene = std::vector<__m128>;

void print_scene(AoSoA_scene const &scene)
{
        // This is likely undefined behavior. Data might need to be stored
        // differently, but this is simpler to index.
        auto &&punned_data = reinterpret_cast<float const *>(scene.data());
        auto scene_size = std::size(scene);

        // Limit to 8 lines
        for(size_t j = 0lu; j < std::min(scene_size, 8lu); ++j) {
                for(size_t i = 0lu; i < 4lu; ++i) {
                        printf("%10.3e ", punned_data[j + 4lu * i]);
                }
                printf("\n");
        }
        if(scene_size > 8lu) {
                printf("(%lu more)...\n", scene_size - 8lu);
        }
        printf("\n");
}

void normalize(AoSoA_scene &scene)
{
        // Euclidean norm, SIMD 4 x 4D-vectors at a time.
        for(size_t i = 0lu; i < scene.size(); i += 4lu) {
                __m128 xs = scene[i + 0lu];
                __m128 ys = scene[i + 1lu];
                __m128 zs = scene[i + 2lu];
                __m128 ws = scene[i + 3lu];

                __m128 xxs = _mm_mul_ps(xs, xs);
                __m128 yys = _mm_mul_ps(ys, ys);
                __m128 zzs = _mm_mul_ps(zs, zs);
                __m128 wws = _mm_mul_ps(ws, ws);

                __m128 xx_yys = _mm_add_ps(xxs, yys);
                __m128 zz_wws = _mm_add_ps(zzs, wws);

                __m128 xx_yy_zz_wws = _mm_add_ps(xx_yys, zz_wws);

                __m128 norms = _mm_sqrt_ps(xx_yy_zz_wws);

                scene[i + 0lu] = _mm_div_ps(xs, norms);
                scene[i + 1lu] = _mm_div_ps(ys, norms);
                scene[i + 2lu] = _mm_div_ps(zs, norms);
                scene[i + 3lu] = _mm_div_ps(ws, norms);
        }
}

float randf()
{
        std::random_device random_device;
        std::default_random_engine random_engine{random_device()};
        std::uniform_real_distribution<float> distribution(-10.0f, 10.0f);
        return distribution(random_engine);
}

int main()
{
        // Scene description, e.g. cameras, or particles, or boids etc.
        // Has to be a multiple of 4!   -- No edge case handling.
        std::vector<__m128> scene(40'000);

        for(size_t i = 0lu; i < std::size(scene); ++i) {
                scene[i] = _mm_set_ps(randf(), randf(), randf(), randf());
        }

        // Print, normalize 100'000 times, print again

        // Compiler is hopefully not smart enough to realize
        // idempotence of normalization
        using std::chrono::steady_clock;
        using std::chrono::duration_cast;
        using std::chrono::milliseconds;
        // >:(

        print_scene(scene);
        printf("Working...\n");

        auto begin = steady_clock::now();
        for(int j = 0; j < 100'000; ++j) {
                normalize(scene);
        }
        auto end = steady_clock::now();
        auto duration = duration_cast<milliseconds>(end - begin);

        printf("Time %lu ms\n", duration.count());
        print_scene(scene);

        return 0;
}

SoA布局

Time 4982 ms

Memory usage summary: heap total: 713728, heap peak: 713728, stack peak: 2992
         total calls   total memory   failed calls
 malloc|          6         713728              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          4         640000

#include <chrono>
#include <cstdio>
#include <random>
#include <vector>
#include <xmmintrin.h>

/* -----------------------------------------------------------------------------
        SoA layout [X,X,X,X,...], [y,y,y,y,...], [Z,Z,Z,Z,...], ...
----------------------------------------------------------------------------- */

struct SoA_scene {
        size_t size;
        float *xs;
        float *ys;
        float *zs;
        float *ws;
};

void print_scene(SoA_scene const &scene)
{
        // This is likely undefined behavior. Data might need to be stored
        // differently, but this is simpler to index.

        // Limit to 8 lines
        for(size_t j = 0lu; j < std::min(scene.size, 8lu); ++j) {
                printf("%10.3e ", scene.xs[j]);
                printf("%10.3e ", scene.ys[j]);
                printf("%10.3e ", scene.zs[j]);
                printf("%10.3e ", scene.ws[j]);
                printf("\n");
        }
        if(scene.size > 8lu) {
                printf("(%lu more)...\n", scene.size - 8lu);
        }
        printf("\n");
}

void normalize(SoA_scene &scene)
{
        // Euclidean norm, SIMD 4 x 4D-vectors at a time.
        for(size_t i = 0lu; i < scene.size; i += 4lu) {
                __m128 xs = _mm_load_ps(&scene.xs[i]);
                __m128 ys = _mm_load_ps(&scene.ys[i]);
                __m128 zs = _mm_load_ps(&scene.zs[i]);
                __m128 ws = _mm_load_ps(&scene.ws[i]);

                __m128 xxs = _mm_mul_ps(xs, xs);
                __m128 yys = _mm_mul_ps(ys, ys);
                __m128 zzs = _mm_mul_ps(zs, zs);
                __m128 wws = _mm_mul_ps(ws, ws);

                __m128 xx_yys = _mm_add_ps(xxs, yys);
                __m128 zz_wws = _mm_add_ps(zzs, wws);

                __m128 xx_yy_zz_wws = _mm_add_ps(xx_yys, zz_wws);

                __m128 norms = _mm_sqrt_ps(xx_yy_zz_wws);

                __m128 normed_xs = _mm_div_ps(xs, norms);
                __m128 normed_ys = _mm_div_ps(ys, norms);
                __m128 normed_zs = _mm_div_ps(zs, norms);
                __m128 normed_ws = _mm_div_ps(ws, norms);

                _mm_store_ps(&scene.xs[i], normed_xs);
                _mm_store_ps(&scene.ys[i], normed_ys);
                _mm_store_ps(&scene.zs[i], normed_zs);
                _mm_store_ps(&scene.ws[i], normed_ws);
        }
}

float randf()
{
        std::random_device random_device;
        std::default_random_engine random_engine{random_device()};
        std::uniform_real_distribution<float> distribution(-10.0f, 10.0f);
        return distribution(random_engine);
}

int main()
{
        // Scene description, e.g. cameras, or particles, or boids etc.
        // Has to be a multiple of 4!   -- No edge case handling.
        auto scene_size = 40'000lu;
        std::vector<float> xs(scene_size);
        std::vector<float> ys(scene_size);
        std::vector<float> zs(scene_size);
        std::vector<float> ws(scene_size);

        for(size_t i = 0lu; i < scene_size; ++i) {
                xs[i] = randf();
                ys[i] = randf();
                zs[i] = randf();
                ws[i] = randf();
        }

        SoA_scene scene{
                scene_size,
                std::data(xs),
                std::data(ys),
                std::data(zs),
                std::data(ws)
        };
        // Print, normalize 100'000 times, print again

        // Compiler is hopefully not smart enough to realize
        // idempotence of normalization
        using std::chrono::steady_clock;
        using std::chrono::duration_cast;
        using std::chrono::milliseconds;
        // >:(

        print_scene(scene);
        printf("Working...\n");

        auto begin = steady_clock::now();
        for(int j = 0; j < 100'000; ++j) {
                normalize(scene);
        }
        auto end = steady_clock::now();
        auto duration = duration_cast<milliseconds>(end - begin);

        printf("Time %lu ms\n", duration.count());
        print_scene(scene);

        return 0;
}

AoS布局

從SSE4.1開始，似乎有第三種選擇 - 迄今為止最簡單和最快的選擇。

Time 3074 ms

Memory usage summary: heap total: 746552, heap peak: 713736, stack peak: 2720
         total calls   total memory   failed calls
 malloc|          5         746552              0
realloc|          0              0              0  (nomove:0, dec:0, free:0)
 calloc|          0              0              0
   free|          2         672816
Histogram for block sizes:
    0-15              1  20% =========================
 1024-1039            1  20% =========================
32816-32831           1  20% =========================
   large              2  40% ==================================================


/* -----------------------------------------------------------------------------
        AoS layout [{X,y,Z,w},{X,y,Z,w},{X,y,Z,w},{X,y,Z,w},...]
----------------------------------------------------------------------------- */

using AoS_scene = std::vector<__m128>;

void print_scene(AoS_scene const &scene)
{
        // This is likely undefined behavior. Data might need to be stored
        // differently, but this is simpler to index.
        auto &&punned_data = reinterpret_cast<float const *>(scene.data());
        auto scene_size = std::size(scene);

        // Limit to 8 lines
        for(size_t j = 0lu; j < std::min(scene_size, 8lu); ++j) {
                for(size_t i = 0lu; i < 4lu; ++i) {
                        printf("%10.3e ", punned_data[j * 4lu + i]);
                }
                printf("\n");
        }
        if(scene_size > 8lu) {
                printf("(%lu more)...\n", scene_size - 8lu);
        }
        printf("\n");
}

void normalize(AoS_scene &scene)
{
        // Euclidean norm, SIMD 4 x 4D-vectors at a time.
        for(size_t i = 0lu; i < scene.size(); i += 4lu) {
                __m128 vec = scene[i];
                __m128 dot = _mm_dp_ps(vec, vec, 255);
                __m128 norms = _mm_sqrt_ps(dot);
                scene[i] = _mm_div_ps(vec, norms);
        }
}

float randf()
{
        std::random_device random_device;
        std::default_random_engine random_engine{random_device()};
        std::uniform_real_distribution<float> distribution(-10.0f, 10.0f);
        return distribution(random_engine);
}

int main()
{
        // Scene description, e.g. cameras, or particles, or boids etc.
        std::vector<__m128> scene(40'000);

        for(size_t i = 0lu; i < std::size(scene); ++i) {
                scene[i] = _mm_set_ps(randf(), randf(), randf(), randf());
        }

        // Print, normalize 100'000 times, print again

        // Compiler is hopefully not smart enough to realize
        // idempotence of normalization
        using std::chrono::steady_clock;
        using std::chrono::duration_cast;
        using std::chrono::milliseconds;
        // >:(

        print_scene(scene);
        printf("Working...\n");

        auto begin = steady_clock::now();
        for(int j = 0; j < 100'000; ++j) {
                normalize(scene);
                //break;
        }
        auto end = steady_clock::now();
        auto duration = duration_cast<milliseconds>(end - begin);

        printf("Time %lu ms\n", duration.count());
        print_scene(scene);

        return 0;
}

關於存儲器布局的直覺，用於快速SIMD /面向數據的設計

問題描述

“標量代碼”

“SSE識別代碼”

但為什么4個獨立的陣列？

4 個解決方案

解決方案1
4 已采納 2019-02-04 13:56:40

解決方案2
3 2019-02-05 00:10:00

解決方案3
1 2019-02-04 14:37:51

解決方案4
1 2019-02-04 18:47:41

條紋布局

SoA布局

AoS布局

關於存儲器布局的直覺，用於快速SIMD /面向數據的設計

問題描述

“標量代碼”

“SSE識別代碼”

但為什么4個獨立的陣列？

4 個解決方案

解決方案1 4 已采納 2019-02-04 13:56:40

解決方案2 3 2019-02-05 00:10:00

解決方案3 1 2019-02-04 14:37:51

解決方案4 1 2019-02-04 18:47:41

條紋布局

SoA布局

AoS布局

解決方案1
4 已采納 2019-02-04 13:56:40

解決方案2
3 2019-02-05 00:10:00

解決方案3
1 2019-02-04 14:37:51

解決方案4
1 2019-02-04 18:47:41