简体   繁体   English

memcpy:GCC 还是实现优化?

[英]memcpy: GCC or implementation optimizations?

Regarding writing an own memcpy function for a custom bootloader and kernel, I decided to look into the various aspects of writing a good and possibly fast implementation for copying memory on aligned boundaries (eg, scrolling in video mode, where each line on the screen is starting on an aligned boundary), but also for large (> 1 MB) and unaligned structures.关于为自定义引导加载程序和内核编写自己的memcpy函数,我决定研究编写在对齐边界上复制内存的良好且可能快速的实现的各个方面(例如,在视频模式下滚动,屏幕上的每一行都是从对齐的边界开始),但也适用于大型(> 1 MB)和未对齐的结构。

My question is, since the compiler, GCC in my case, does support a variety of optimization options (either by enabling the individual options or by using O2 , O3 , ...), to what level of optimization do I need to implement the actual memcpy function to achieve the best result when copying, together with the GCC optimization flags?我的问题是,由于编译器(在我的情况下为 GCC)确实支持各种优化选项(通过启用单个选项或通过使用O2O3等),我需要实现什么级别的优化实际的 memcpy 函数在复制时达到最佳效果,以及 GCC 优化标志?

My current implementation is the following:我目前的实现如下:

static void *memcpy_unaligned(void *dst, const void *src, size_t len)
{
    size_t i;
    unsigned char *d = (unsigned char *)dst;
    unsigned char *s = (unsigned char *)src;

    for (i = 0; i < len; i++)
        d[i] = s[i];

    return dst; 
}

static void *memcpy_aligned16(void *dst, const void *src, size_t len)
{
    size_t i;
    uint16_t *d = (uint16_t *)dst;
    uint16_t *s = (uint16_t *)src;

    for (i = 0; i < ((len) & (~1)); i += 2)
        d[i] = s[i];

    for ( ; i < len; i++)
        ((unsigned char *)d)[i] = ((unsigned char *)s)[i];

    return dst;
}

static void *memcpy_aligned32(void *dst, const void *src, size_t len)
{
    size_t i;
    uint32_t *d = (uint32_t *)dst;
    uint32_t *s = (uint32_t *)src;

    for (i = 0; i < ((len) & (~3)); i += 4)
        d[i] = s[i];

    for ( ; i < len; i++)
        ((unsigned char *)d)[i] = ((unsigned char *)s)[i];

    return dst;
}

static void *memcpy_aligned(void *dst, const void *src, size_t len)
{
    /* Are dst and src aligned on a 4-byte boundary? */
    if (ALIGNED(dst, src, 4))
        return memcpy_aligned32(dst, src, len);

    /* Are dst and src aligned on a 2-byte boundary? */
    if (ALIGNED(dst, src, 2))
        return memcpy_aligned16(dst, src, len);

    return memcpy_unaligned(dst, src, len);
}

void* memcpy(void *dst, const void *src, size_t len)
{
    return memcpy_aligned(dst, src, len);
}

Is it also useful to check if the dst and the src pointers are aligned at odd boundaries for only the first or first three bytes, in order to do a single-byte copy first, followed by word and dword copying?检查dstsrc指针是否仅在前三个字节的奇数边界对齐是否也有用,以便首先进行单字节复制,然后进行worddword复制?

Is it also useful to check if the dst and the src pointers are aligned at odd boundaries for only the first or first three bytes, in order to do a single-byte copy first, followed by word and dword copying?检查 dst 和 src 指针是否仅在前三个字节的奇数边界对齐是否也有用,以便首先进行单字节复制,然后进行字和双字复制?

Profiling for such matters, yet get the function right first.对此类问题进行分析,但首先要获得正确的功能。

OP's code has functional errors. OP 的代码有功能错误。

  • OP original code is increasing index too fast OP原代码增加索引太快
  • Note: (len) & (~3) may incorrectly mask on rare non-2's complement注意:(len) & (~3) 可能会错误地屏蔽罕见的非 2 的补码

Use restrict to allow additional optimizations.使用restrict允许额外的优化。 Note: memcpy() is UB when buffers overlap.注意:当缓冲区重叠时, memcpy()是 UB。

static void *memcpy_aligned32(void * restrict dst, const void *restrict src, size_t len) {
  size_t i;
  // Casts not needed.  Do not cast away const-ness
  uint32_t *d = dst;
  const uint32_t *s = src;  

  size_t l4 = len/4;
  for (i = 0; i < l4; i++) {
    d[i] = s[i];
  }

  i *= 4;
  for ( ; i < len; i++) {
    ((unsigned char *)d)[i] = ((unsigned char *)s)[i]; // I'd use `uint8_t*` for symmetry
  }  

  return dst;
}

Perhaps additional AA issues apply.也许其他 AA 问题适用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM