简体   繁体   English

矩阵转置中的缓存利用率c

[英]Cache utilization in matrix transpose in c

This code transposes a matrix four ways. 该代码以四种方式转换矩阵。 The first does sequential writes, non sequential reads. 第一个是顺序写入,非顺序读取。 The second is the opposite. 第二个是相反的。 The next two are the same, but with cache skipping writes. 接下来的两个是相同的,但缓存跳过写入。 What seems to happen is sequential writes are faster, and skipping the cache is faster. 似乎发生的是顺序写入更快,并且跳过缓存更快。 What I don't understand is, if the cache is being skipped why are sequential writes still faster? 我不明白的是,如果跳过缓存,为什么顺序写入仍然更快?

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 1:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
     tmp[i][j] = mul2[j][i];
QueryPerformanceCounter(&after);
printf("Transpose 2:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (i = 0; i < N; ++i)
   for (j = 0; j < N; ++j)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 3:\t%ld\n", after.QuadPart - before.QuadPart);

QueryPerformanceCounter(&before);
for (j = 0; j < N; ++j)
   for (i = 0; i < N; ++i)
      _mm_stream_si32(&tmp[i][j], mul2[j][i]);
QueryPerformanceCounter(&after);
printf("Transpose 4:\t%ld\n", after.QuadPart - before.QuadPart);

EDIT: The output is 编辑:输出是

Transpose 1:    47603
Transpose 2:    92449
Transpose 3:    38340
Transpose 4:    69597

CPU has a write combining buffer to combine writes on a cache line to happen in one burst. CPU具有写入组合缓冲区,用于将高速缓存行上的写入组合在一个突发中。 In this case (cache being skipped for sequential writes), this write combining buffer acts as a one line cache which makes the results be very similar to cache not being skipped. 在这种情况下(为顺序写入跳过高速缓存),此写入组合缓冲区充当单行高速缓存,这使得结果非常类似于不被跳过的高速缓存。

To be exact, in case of cache being skipped, writes are still happening in bursts to memory. 确切地说,在跳过缓存的情况下,写入仍然以突发的形式发生在内存中。

See write-combining logic behavior here. 请参阅此处的写入组合逻辑行为。

You could try non linear memory layout for the matrix to improve cache utilization. 您可以尝试矩阵的非线性内存布局,以提高缓存利用率。 With 4x4 32bit float tiles one could do transpose with only single access to each cache line. 使用4x4 32位浮动磁贴,可以进行转置,只需对每个缓存线进行单一访问。 Plus as a bonus tile transposes could be done easily with _MM_TRANSPOSE4_PS. 另外,使用_MM_TRANSPOSE4_PS可以轻松完成奖励磁贴转置。

Transposing a very large matrix is still very memory intensive operation. 转置非常大的矩阵仍然是非常耗费内存的操作。 It will still be heavily bandwidth limited but at least cache word load is near optimal. 它仍将受到很大的带宽限制,但至少缓存字负载接近最佳。 I don't know if the performance could be still optimized. 我不知道性能是否仍然可以优化。 My testing shows that a few years old laptop manages to transpose 16k*16k (1G memory) in about 300ms. 我的测试表明,几年前的笔记本电脑设法在大约300ms内转换16k * 16k(1G内存)。

I tried to use also _mm_stream_sd but it actually makes performance worse for some reason. 我也尝试使用_mm_stream_sd但实际上由于某些原因使性能变差。 I don't understand nontemporal memory writes enough to have any practical guess why performance would drop with _mm_stream_ps. 我不明白非时间内存写入足以让任何实际猜测为什么性能会随着_mm_stream_ps而下降。 Possible reason is of course that cache line is already in L1 cache ready for the write operation. 可能的原因当然是高速缓存行已经在L1高速缓存中准备好进行写操作。

But actually important part with non linear matrix would possibility to avoid transpose completely and simple run the multiplication in tile friendly order. 但实际上非线性矩阵的重要部分可能完全避免转置,并且简单地以瓦片友好顺序运行乘法。 But I only have transpose code that I'm using to improve my knowledge about cache management in algorithms. 但我只转换了我正在使用的代码来提高我对算法中缓存管理的了解。

I haven't yet tried to test if prefetching would improve memory bandwidth usage. 我还没有尝试测试预取是否会提高内存带宽使用率。 Current code runs at about 0.5 instructions per cycle (good cache friendly code runs around 2 ins per cycle on this CPU) that leaves a lot of free cycles for prefetch instructions allowing even quite complex calculation to optimize prefetching timing in runtime. 当前代码每个周期运行大约0.5个指令(良好的缓存友好代码在该CPU上每个周期运行大约2个ins),这为预取指令留下了大量的空闲周期,从而允许甚至非常复杂的计算来优化运行时的预取时序。

example code from my transpose benchmark test follows. 我的转置基准测试的示例代码如下。

#define MATSIZE 16384
#define align(val, a) (val + (a - val % a))
#define tilewidth 4
typedef int matrix[align(MATSIZE,tilewidth)*MATSIZE] __attribute__((aligned(64)));


float &index(matrix m, unsigned i, unsigned j)
{
    /* tiled address calculation */
    /* single cache line is used for 4x4 sub matrices (64 bytes = 4*4*sizeof(int) */
    /* tiles are arranged linearly from top to bottom */
    /*
     * eg: 16x16 matrix tile positions:
     *   t1 t5 t9  t13
     *   t2 t6 t10 t14
     *   t3 t7 t11 t15
     *   t4 t8 t12 t16
     */
    const unsigned tilestride = tilewidth * MATSIZE;
    const unsigned comp0 = i % tilewidth; /* i inside tile is least significant part */
    const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
    const unsigned comp2 = i / tilewidth * tilestride;
    const unsigned add = comp0 + comp1 + comp2;
    return m[add];
}

/* Get start of tile reference */
float &tile(matrix m, unsigned i, unsigned j)
{
    const unsigned tilestride = tilewidth * MATSIZE;
    const unsigned comp1 = j * tilewidth; /* next part is j multiplied by tile width */
    const unsigned comp2 = i / tilewidth * tilestride;
    return m[comp1 + comp2];

}
template<bool diagonal>
static void doswap(matrix mat, unsigned i, unsigned j)
{
        /* special path to swap whole tile at once */
        union {
            float *fs;
            __m128 *mm;
        } src, dst;
        src.fs = &tile(mat, i, j);
        dst.fs = &tile(mat, j, i);
        if (!diagonal) {
            __m128 srcrow0 = src.mm[0];
            __m128 srcrow1 = src.mm[1];
            __m128 srcrow2 = src.mm[2];
            __m128 srcrow3 = src.mm[3];
            __m128 dstrow0 = dst.mm[0];
            __m128 dstrow1 = dst.mm[1];
            __m128 dstrow2 = dst.mm[2];
            __m128 dstrow3 = dst.mm[3];

            _MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);
            _MM_TRANSPOSE4_PS(dstrow0, dstrow1, dstrow2, dstrow3);

#if STREAMWRITE == 1
            _mm_stream_ps(src.fs +  0, dstrow0);
            _mm_stream_ps(src.fs +  4, dstrow1);
            _mm_stream_ps(src.fs +  8, dstrow2);
            _mm_stream_ps(src.fs + 12, dstrow3);
            _mm_stream_ps(dst.fs +  0, srcrow0);
            _mm_stream_ps(dst.fs +  4, srcrow1);
            _mm_stream_ps(dst.fs +  8, srcrow2);
            _mm_stream_ps(dst.fs + 12, srcrow3);
#else
            src.mm[0] = dstrow0;
            src.mm[1] = dstrow1;
            src.mm[2] = dstrow2;
            src.mm[3] = dstrow3;
            dst.mm[0] = srcrow0;
            dst.mm[1] = srcrow1;
            dst.mm[2] = srcrow2;
            dst.mm[3] = srcrow3;
#endif
        } else {
            __m128 srcrow0 = src.mm[0];
            __m128 srcrow1 = src.mm[1];
            __m128 srcrow2 = src.mm[2];
            __m128 srcrow3 = src.mm[3];

            _MM_TRANSPOSE4_PS(srcrow0, srcrow1, srcrow2, srcrow3);

#if STREAMWRITE == 1
            _mm_stream_ps(src.fs +  0, srcrow0);
            _mm_stream_ps(src.fs +  4, srcrow1);
            _mm_stream_ps(src.fs +  8, srcrow2);
            _mm_stream_ps(src.fs + 12, srcrow3);
#else
            src.mm[0] = srcrow0;
            src.mm[1] = srcrow1;
            src.mm[2] = srcrow2;
            src.mm[3] = srcrow3;
#endif
        }
    }
}

static void transpose(matrix mat)
{
    const unsigned xstep = 256;
    const unsigned ystep = 256;
    const unsigned istep = 4;
    const unsigned jstep = 4;
    unsigned x1, y1, i, j;
    /* need to increment x check for y limit to allow unrolled inner loop
     * access entries close to diagonal axis
     */
    for (x1 = 0; x1 < MATSIZE - xstep + 1 && MATSIZE > xstep && xstep; x1 += xstep)
        for (y1 = 0; y1 < std::min(MATSIZE - ystep + 1, x1 + 1); y1 += ystep)
            for ( i = x1 ; i < x1 + xstep; i += istep ) {
                for ( j = y1 ; j < std::min(y1 + ystep, i); j+= jstep )
                {
                    doswap<false>(mat, i, j);
                }
                if (i == j && j < (y1 + ystep))
                    doswap<true>(mat, i, j);
            }

    for ( i = 0 ; i < x1; i += istep ) {
        for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
        {
            doswap<false>(mat, i, j);
        }
        if (i == j)
            doswap<true>(mat, i, j);
    }
    for ( i = x1 ; i < MATSIZE - istep + 1; i += istep ) {
        for ( j = y1 ; j < std::min(MATSIZE - jstep + 1, i); j+= jstep )
        {
            doswap<false>(mat, i, j);
        }
        if (i == j)
            doswap<true>(mat, i, j);
    }
    x1 = MATSIZE - MATSIZE % istep;
    y1 = MATSIZE - MATSIZE % jstep;

    for ( i = x1 ; i < MATSIZE; i++ )
        for ( j = 0 ; j < std::min((unsigned)MATSIZE, i); j++ )
                    std::swap(index(mat, i, j+0), index(mat, j+0, i));

    for ( i = 0; i < x1; i++ )
        for ( j = y1 ; j < std::min((unsigned)MATSIZE, i) ; j++ )
                    std::swap(index(mat, i, j+0), index(mat, j+0, i));
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM