简体   繁体   English

Memset struct 变量分别与 memset 整个结构,哪个更快?

[英]Memset struct variables separately vs memset entire struct, which is faster?

Say I have a structure like this:假设我有这样的结构:

struct tmp {
    unsigned char arr1[10];
    unsigned char arr2[10];
    int  i1;
    int  i2;
    unsigned char arr3[10];
    unsigned char arr4[10];
};

Which of these would be faster?其中哪一个会更快?

(1) Memset entire struct to 0 and then fill members as: (1) Memset 整个结构为 0,然后将成员填充为:

struct tmp t1;
memset(&t1, 0, sizeof(struct tmp));

t1.i1 = 10;
t1.i2 = 20;
memcpy(t1.arr1, "ab", sizeof("ab"));
// arr2, arr3 and arr4 will be filled later.

OR或者

(2) Memset separate variables: (2) Memset 分离变量:

struct tmp t1;
t1.i1 = 10;
t1.i2 = 20;
memcpy(t1.arr1, "ab", sizeof("ab"));

memset(t1.arr2, 0, sizeof(t1.arr2); // will be filled later
memset(t2.arr3, 0, sizeof(t1.arr3); // will be filled later
memset(t2.arr4, 0, sizeof(t1.arr4); // will be filled later

Just in terms of performance, is multiple calls to memset faster (on separate members of a structure) faster/slower than a single call to memset (on the entire structure).就性能而言,对 memset 的多次调用(在结构的单独成员上)比对 memset 的单个调用(在整个结构上)更快/更慢。

It isn't really meaningful to discuss this without a specific system in mind, nor is it fruitful to ponder these things unless you actually have a performance bottlneck.在没有考虑特定系统的情况下讨论这个问题并没有真正的意义,除非你真的有一个性能瓶颈,否则思考这些事情也没有什么成果。 I can give it a try still.我还是可以试试的。

For a "general computer", you would have to consider:对于“通用计算机”,您必须考虑:

  • Aligned access对齐访问
    Accessing a chunk of data in one go is usually better.在一个 go 中访问一大块数据通常会更好。 In case of potential misalignment, the overhead code to deal with that is roughly the same no matter how large the data is.在潜在错位的情况下,无论数据有多大,处理该问题的开销代码大致相同。 Assuming theoretically that all access in this code happens to be misaligned, then 1 memset call is better than 3.理论上假设这段代码中的所有访问都发生了错位,那么 1 次 memset 调用优于 3 次。

    Also, we can assume that the first item of a struct is aligned, but we cannot assume that for any individual member inside the struct.此外,我们可以假设结构的第一项是对齐的,但我们不能假设结构内的任何单个成员都是如此。 The linker will allocate the struct at an aligned address, then potentially add padding anywhere inside it to compensate for misalignment. linker 将在对齐的地址分配结构,然后可能在其中的任何位置添加填充以补偿未对齐。

    Your struct has been declared without any consideration about alignment, so this will be an issue here - the compiler will insert lots of padding.声明您的结构时没有考虑 alignment,所以这将是一个问题 - 编译器将插入大量填充。

    (On the other hand, a memset on the whole struct will also overwrite padding bytes, which is a tiny bit of overhead code.) (另一方面,整个结构上的 memset 也会覆盖填充字节,这是一点点开销代码。)

  • Data cache use数据缓存使用
    Accessing an area of adjacent memory from top to bottom is much more "cache-friendly" than accessing fragments of it from multiple places in your code.从上到下访问相邻 memory 的区域比从代码中的多个位置访问它的片段更“缓存友好”。 Subsequent access of contiguous memory means that the computer can load a lot of data into cache, instead of fetching it from RAM, which is slower.随后访问连续的 memory 意味着计算机可以将大量数据加载到缓存中,而不是从 RAM 中获取数据,这样比较慢。

  • Instruction cache use and branch prediction指令缓存使用和分支预测
    Not very relevant in this case, since the code is basically just doing raw copies and doing so branch-free.在这种情况下不是很相关,因为代码基本上只是在做原始副本并且没有分支。

  • The amount of machine instructions generated生成的机器指令数量
    This is always a good, rough indication of how fast the code is.这始终是代码速度的一个很好的粗略指示。 Obviously some instructions are a lot slower than others etc, but less instructions very often means faster code.显然,有些指令比其他指令慢很多,但较少的指令通常意味着更快的代码。 Dissassembling your two functions with gcc x86_64 -O3 then I get this:用 gcc x86_64 -O3 拆解你的两个功能然后我得到这个:

     func1: movabs rax, 85899345930 pxor xmm0, xmm0 movups XMMWORD PTR [rdi+16], xmm0 mov QWORD PTR [rdi+20], rax mov eax, 25185 movups XMMWORD PTR [rdi], xmm0 movups XMMWORD PTR [rdi+32], xmm0 mov WORD PTR [rdi], ax ret func2: movabs rax, 85899345930 xor edx, edx xor ecx, ecx xor esi, esi mov QWORD PTR [rdi+20], rax mov eax, 25185 mov WORD PTR [rdi], ax mov BYTE PTR [rdi+2], 0 mov QWORD PTR [rdi+10], 0 mov WORD PTR [rdi+18], dx mov QWORD PTR [rdi+28], 0 mov WORD PTR [rdi+36], cx mov QWORD PTR [rdi+38], 0 mov WORD PTR [rdi+46], si ret

    This is a pretty good indication that the former code is more efficient, and it should also be more data cache-friendly, so it would surprise me if (1) isn't significantly faster.这很好地表明了前一个代码更高效,并且它也应该对数据缓存更友好,所以如果 (1) 没有明显更快,我会感到惊讶。

Also note that if you declared this struct with static storage duration, you would "outsource" the zero-out to the CRT part of the program setting .bss and getting executed before main() is even called.另请注意,如果您使用 static 存储持续时间声明此结构,您会将零输出“外包”给程序设置.bss的 CRT 部分,并在 main() 甚至被调用之前执行。 Then none of these memset would be needed.那么这些 memset 都不需要了。 At the expensive of slightly slower start-up, but a faster program overall.代价是启动速度稍慢,但整体程序速度更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM