[英]malloc once, then distribute memory over struct arrays
I have a struct that has the following memory layout: 我有一个具有以下内存布局的结构:
uint32_t
variable length array of type uint16_t
variable length array of type uint16_t
Because of the variable length of the arrays I have pointers to these arrays, effectively: 由于数组的长度可变,因此我有指向这些数组的有效指针:
struct struct1 {
uint32_t n;
uint16_t *array1;
uint16_t *array2;
};
typedef struct struct1 struct1;
Now, when allocation these structs I see two options: 现在,在分配这些结构时,我看到两个选择:
A) malloc the struct itself, then malloc space for the arrays individually and set the pointers in the struct to point to the correct memory location: A)malloc结构体本身,然后为数组分别分配malloc空间,并将结构体中的指针设置为指向正确的内存位置:
uint32_t n1 = 10;
uint32_t n2 = 20;
struct1 *s1 = malloc(sizeof(struct1));
uint16 *array1 = malloc(sizeof(uint16) * n1));
uint16 *array2 = malloc(sizeof(uint16) * n2));
s1->n = n1;
s1->array1 = array1;
s1->array2 = array2;
B) malloc memory for everything combined, then "distribute" the memory over the struct: B)为所有组合使用malloc内存,然后在struct上“分配”内存:
struct1 *s1 = malloc(sizeof(struct1) + (n1 + n2) * sizeof(uint16_t));
s1->n = n1;
s1->array1 = s1 + sizeof(struct1);
s1->array2 = s1 + sizeof(struct1) + n1 * sizeof(uint16_t);
Note that array1 and array2 are not bigger than a few KB and usually not a lot of struct1s are needed. 请注意,array1和array2的大小不超过几个KB,通常不需要很多struct1。 However, cache efficiency is a concern as numeric data crunching is done with this struct.
但是,由于使用此结构完成了数值数据处理,因此需要考虑缓存效率。
Note, that right now I'm using gcc (C89?) on linux but could use C99/C11 if necessary. 请注意,现在我在Linux上使用gcc(C89?),但必要时可以使用C99 / C11。 Thanks in advance.
提前致谢。
EDIT: To clarify further : The size of the arrays will never change after creation. 编辑:进一步澄清 :创建后,数组的大小将永远不会改变。 Multiple struct1s will not be always be allocated at once but rather occasionally during the program's runtime.
多个struct1不会总是一次分配,而是在程序运行时偶尔分配。
I think your option A is much cleaner and would scale in a more sensible way. 我认为您的选择A更清洁,可以更明智地扩展。 Imagine having to
realloc
space when the array in one of the structures becomes larger: in option A, you can realloc
that memory since it isn't logically attached to anything else. 想象一下,
realloc
空间时,在结构中的一个数组变得更大:在选项A,您可以realloc
内存,因为它没有逻辑连接到任何东西。 In option B, you need to add in additional logic to ensure you don't break your other array. 在选项B中,您需要添加其他逻辑以确保不破坏其他阵列。
I also think (even in C89, but I could be wrong) that there is nothing wrong with this: 我还认为(即使在C89中,但我可能是错的)这没有错:
struct1 *s1 = malloc(sizeof(struct1));
s1->array1 = malloc(sizeof(uint16) * n1));
s1->array2 = malloc(sizeof(uint16) * n2));
s1->n = n1;
The above takes out the middle-man arrays. 上面取出了中间人数组。 I think it is cleaner because you immediately see that you are allocating space for a pointer in a structure.
我认为这样比较干净,因为您会立即看到正在为结构中的指针分配空间。
I have used your option B before for 2D arrays, where I just allocate a single space and use logical rules in my code to use it as a 2D space. 我之前将选项B用于2D数组,在这里我只分配一个空间,并在代码中使用逻辑规则将其用作2D空间。 This is useful when I want it to be a rectangular 2D space, so when I increase it, I always increase each row or column.
当我希望它是一个矩形2D空间时,这很有用,因此当我增加它时,我总是增加每一行或每一列。 In other words, I never want to have heterogeneous array sizes.
换句话说,我永远都不想拥有异构数组大小。
Because you clarified that your structures/arrays will never need to be reallocated, I think option B is less bad . 因为您已经阐明了您的结构/数组将永远都不需要重新分配,所以我认为选项B 不太糟糕 。 It still seems to be a worse solution for this application than option A, and here are my reasons for thinking this:
对于该应用程序,它似乎仍然比选项A更糟糕,这是我考虑这一点的原因:
malloc
is optimized such that there wouldn't be much optimization from allocating a single space compared to allocating the spaces individually. malloc
已优化,因此与单独分配空间相比,分配单个空间不会有太多优化。 So, if you comment the code thoroughly, and your application absolutely requires you to optimize everything you possibly can, at the expense of clean and logically sensible code (where memory space and data structures are logically separated in a similar way), and you know that this optimization is better than what a good compiler (like Clang) can do, then option B could be a better option. 因此,如果您对代码进行彻底注释,并且您的应用程序绝对要求您优化所有可能的代码,则以干净且逻辑上合理的代码(内存空间和数据结构以相似的方式在逻辑上分开)为代价,并且您知道这种优化比好的编译器(如Clang)所能做的更好,那么选项B 可能是更好的选择。
In the spirit of self-criticism I wanted to see if I could evaluate the difference. 本着自我批评的精神,我想看看我是否可以评估这种差异。 So I wrote two programs (one for option A and one for option B) and compiled them with optimizations off.
因此,我编写了两个程序(一个用于选项A,一个用于选项B),并在不进行优化的情况下对其进行了编译。 I used a FreeBSD virtual machine to get as clean of an environment as possible, and I used
gcc
. 我使用FreeBSD虚拟机来尽可能清洁环境,并使用
gcc
。
Here are the programs that I used to test the two methods: 这是我用来测试这两种方法的程序:
optionA.c: optionA.c:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define NSIZE 100000
#define NTESTS 10000000
struct test_struct {
int n;
int *array1;
int *array2;
};
void freeA(struct test_struct *input) {
free(input->array1);
free(input->array2);
free(input);
return;
}
void optionA() {
struct test_struct *s1 = malloc(sizeof(*s1));
s1->array1 = malloc(sizeof(*(s1->array1)) * NSIZE);
s1->array2 = malloc(sizeof(*(s1->array1)) * NSIZE);
s1->n = NSIZE;
freeA(s1);
s1 = 0;
return;
}
int main() {
clock_t beginA = clock();
int i;
for (i=0; i<NTESTS; i++) {
optionA();
}
clock_t endA = clock();
int time_spent_A = (endA - beginA);
printf("Time spent for option A: %d\n", time_spent_A);
return 0;
}
optionB.c: optionB.c:
#include <stdlib.h>
#include <stdio.h>
#include <time.h>
#define NSIZE 100000
#define NTESTS 10000000
struct test_struct {
int n;
int *array1;
int *array2;
};
void freeB(struct test_struct *input) {
free(input);
return;
}
void optionB() {
struct test_struct *s1 = malloc(sizeof(*s1) + 2*NSIZE*sizeof(*(s1->array1)));
s1->array1 = s1 + sizeof(*s1);
s1->array2 = s1 + sizeof(*s1) + NSIZE*sizeof(*(s1->array1));
s1->n = NSIZE;
freeB(s1);
s1 = 0;
return;
}
int main() {
clock_t beginB = clock();
int i;
for (i=0; i<NTESTS; i++) {
optionB();
}
clock_t endB = clock();
int time_spent_B = (endB - beginB);
printf("Time spent for option B: %d\n", time_spent_B);
return 0;
}
Results for these tests are given in clocks (see clock(3) for more information). 这些测试的结果以时钟为单位给出(有关更多信息,请参见clock(3) )。
Series | Option A | Option B
------------------------------
1 | 332 | 158
------------------------------
2 | 334 | 155
------------------------------
3 | 334 | 156
------------------------------
4 | 333 | 154
------------------------------
5 | 339 | 156
------------------------------
6 | 334 | 155
------------------------------
avg | 336.0 | 155.7
------------------------------
Note that these speeds are still incredibly fast and translate to milliseconds over millions of tests. 请注意,这些速度仍然非常快,在数百万次测试中转换为毫秒。 I have also found that Clang (
cc
) is better than gcc
at optimizing. 我还发现Clang(
cc
)在优化方面比gcc
更好。 On my machine, even after writing a method that writes data to the arrays (to ensure they don't get optimized out of existence) I got no differential between the two methods when compiling with cc
. 在我的机器上,即使编写了将数据写入数组的方法(以确保它们不因存在而无法优化),使用
cc
编译时,这两种方法之间也没有区别。
I would advice a hybrid of the two: 我建议将两者混合使用:
allocate the structs in one call (it is now an array of structs); 在一个调用中分配结构(现在是结构数组);
allocate the arrays in one call, and make sure the size includes any padding for the allignment required by your compiler/platform; 在一次调用中分配数组,并确保大小包括编译器/平台所需的所有填充;
distribute the arrays over the structs, taking the allignment into acount. 将数组分布在结构上,并考虑到分配。
However, malloc
is already optimized, so your first solution would still be prefered. 但是,
malloc
已经进行了优化,因此仍将首选您的第一个解决方案。
Note: as user Greg Schmit's solution points out, allocating all the arrays in one time, will cause difficulties if the array size needs to be changed in run-time 注意:正如用户Greg Schmit的解决方案指出的那样,一次分配所有数组会导致困难,如果需要在运行时更改数组大小
Because the two arrays have the same type, there are much more options than that, based on creative use of the C99 flexible array member. 由于两个数组具有相同的类型,因此基于C99 flexible数组成员的创造性使用,有更多的选择。 I'd recommend you use a pointer only for the second array,
我建议您仅将指针用于第二个数组,
struct foo {
uint16_t *array2;
uint32_t field;
uint16_t array1[];
};
and allocate memory for both at the same time: 并同时为两者分配内存:
struct foo *foo_new(const size_t length1, const size_t length2)
{
struct foo *result;
result = malloc( sizeof (struct foo)
+ length1 * sizeof (uint16_t)
+ length2 * sizeof (uint16_t) );
if (!result)
return NULL;
result->array2 = result->array1 + length1;
return result;
}
Note that with struct foo *bar
, accessing element i
in the two arrays uses the same notation, bar->array1[i]
and bar->array2[i]
, respectively. 请注意,使用
struct foo *bar
,访问两个数组中的元素i
分别使用相同的符号bar->array1[i]
和bar->array2[i]
。
In the context of scientific computing, I would consider completely other options, depending on the access patterns. 在科学计算的背景下,我将根据访问模式完全考虑其他选择。 For example, if the two arrays are accessed in lockstep (in any direction), I would use
例如,如果两个数组以锁步方式(沿任何方向)访问,则我将使用
typedef uint16_t pair16[2];
struct bar {
uint32_t field;
pair16 array[];
};
If the arrays were large, then copying them into temporary buffers (arrays of pair16
above, if accessed in lockstep) would possibly help, but with at most a few thousand entries, it is likely not going to give a significant speed boost. 如果数组很大,则将它们复制到临时缓冲区(上面的
pair16
数组,如果以锁步的方式访问)可能会有所帮助,但最多具有数千个条目,可能不会显着提高速度。
In cases where the access pattern depends on the other, but you still do enough of computation on each entry, it may be useful to compute the address of the next entry early, and use __builtin_prefetch()
GCC built-in to tell the CPU you'll need it soon, before doing the computation on the current entry. 如果访问模式彼此依赖,但是您仍然需要对每个条目进行足够的计算,则尽早计算下一个条目的地址并使用内置的
__builtin_prefetch()
GCC告诉CPU您可能会很有用。在对当前条目进行计算之前,很快就会需要它。 It may reduce the data access latencies (although the access predictors are pretty darn good on current processors already). 这可能会减少数据访问延迟(尽管访问预测器在当前处理器上已经相当不错了)。
With GCC (and to a lesser extent on other common compilers like Intel Compiler Collection, Portland Group, and Pathscale C compilers), I've noticed that code that manipulates pointers (instead of array pointers and array indexing) compiles to better machine code on x86 and x86-64. 使用GCC(并且在较小程度上使用了其他常见的编译器,如Intel Compiler Collection,Portland Group和Pathscale C编译器),我注意到操纵指针的代码(而不是数组指针和数组索引)可以编译为更好的机器代码。 x86和x86-64。 (The reason is actually quite simple: with array pointers and array indexing, you need at least two separate registers, and x86 has relatively few of those. Even x86-64 doesn't have that many of them. GCC in particular is not very strong at optimizing register usage -- it's much better now than in the version 3 era --, so this seems to help a lot in some cases).
(原因实际上很简单:使用数组指针和数组索引,您至少需要两个单独的寄存器,而x86相对较少。即使x86-64也没有那么多。特别是GCC并不是很擅长优化寄存器使用情况-现在比版本3时代要好得多-因此在某些情况下似乎有很大帮助)。 For example, if you were to access the first array in a
struct foo
sequentially, then 例如,如果要顺序访问
struct foo
的第一个数组,则
void do_something(struct foo *ref)
{
uint16_t *array1 = ref->array1;
uint16_t *const limit1 = ref->array1 + (number of elements in array1);
for (; array1 < limit1; array1++) {
/* ... */
}
}
Approach B is possible, (why don't you just try it?) and it is better, not so much because of memory locality, but because malloc()
costs, so the fewer times you call it, the better off you are. 方法B是可行的((为什么不尝试一下呢?)),它更好,并不是因为内存局部性太大,而是因为
malloc()
花费很大,所以调用它的次数越少,效果就越好。 (Assuming that 'better' means 'faster', which admittedly, is not necessarily the case.) (假定“更好”意味着“更快”,这不一定是事实。)
Memory locality is only marginally improved, since all memory blocks would most likely be continuous (one after the other) in memory, so if you went with approach A your arrays would only be separated by block headers, which are not very big. 内存局部性仅略微提高了,因为所有内存块很可能在内存中是连续的(一个接一个),因此,如果采用方法A,则阵列将仅由块头分开,块头不是很大。 (Of the order of 32 bytes each, maybe a bit larger, but not by much.) The only situation in which your blocks would not be continuous is if you had previously been doing
malloc()
and free()
, so your memory would be fragmented. (每个字节大约32个字节,可能稍大一些,但不是很多。)块不连续的唯一情况是,如果您以前一直在执行
malloc()
和free()
,那么您的内存将支离破碎。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.