简体   繁体   English

传递给函数并访问C中存储的值时,可以有效使用结构的struct

[英]Using struct of structs efficiently when passing into function and accessing values stored within in C

I have a code that is organized in such a way that I have a struct of structs, and in my main method I have a number of functions that take pointer to the main struct as an argument. 我有一个组织成这样的代码,使我拥有一个结构体,在我的主要方法中,我有许多函数将指向主要结构体的指针作为参数。 I am wondering certain choices I made in such an organization would affect the speed of my code adversely. 我想知道在这样的组织中做出的某些选择会不利地影响我的代码速度。 A minimal example code for the sake of my question would look like this: 为了我的问题,一个最小的示例代码如下所示:

#define NPMAX 50000

typedef struct Particles{
    double *X, *Y, *Z;
} Particles;

typedef struct Properties{
    int Npart;
    double Box[3];
    double minDist;
} Properties;

typedef struct System{
    Properties props;
    Particles parts;
} System;

void function(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        dist = pow(sys->parts.X[i],2.) + pow(sys->parts.Y[i],2.) + pow(sys->parts.Z[i],2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

With the following main method: 使用以下主要方法:

int main(){
    System sys;
    sys.parts.X = (double *)malloc(sizeof(double) * NPMAX);
    sys.parts.Y = (double *)malloc(sizeof(double) * NPMAX);
    sys.parts.Z = (double *)malloc(sizeof(double) * NPMAX);

    //... some code to populate sys->parts.X, Y, and Z ... 

    sys.props.Npart = 1000;
    sys.props.Box[0] = 10.; //etc.
    sys.props.minDist = 9999.;

    function(&sys);

    // some file I/O

    return;

}

My question is, given this data structure, have I organized my function in the best possible way for efficiency? 我的问题是,鉴于这种数据结构,我是否以最佳方式组织了职能以提高效率? I mean that speed-wise, not in terms of memory. 我的意思是速度,而不是内存。 More specifically: 进一步来说:

  • Is accessing and assigning values to sys->parts.X[i] slower than creating a pointer directly to sys->parts within the function and doing parts->X[i] , for instance? 被访问和分配值给sys->parts.X[i]直接创建一个指针较慢sys->parts的功能内,做parts->X[i]例如?

  • Is having variables allocated both in heap and stack within the same struct a wise choice speed-wise? 在同一个结构中的堆和栈中分配变量是否明智? Is the program losing time trying to access these values in the memory because of this mix? 程序是否因为这种混合而浪费时间尝试访问内存中的这些值?

  • Should I expect this approach to be faster than just using a global variable for each individual variable declared within the structs? 我是否希望这种方法比对结构中声明的每个单个变量使用全局变量更快?

I have access to intel compilers in addition to gcc and I'm compiling with the -O3 flag. 除了gcc之外,我还可以访问intel编译器,并且正在使用-O3标志进行编译。

The memory layout looks fine. 内存布局看起来不错。 With only a few allocations the structure doesn't matter that much. 仅需很少的分配,结构就没那么重要了。 Those double arrays do offer a nice option for vector computing with a temporary array in between. 那些双精度数组确实为向量计算提供了一个不错的选择,并且它们之间有一个临时数组。

// collect computations first
double dist[NPMAX];
// process 8 64-bit floating-points at a time
int n = sys->props.Npart & ~7;
for(int i = 0; i < n; i += 8){
    _m512d xsq = _mm512_sqrt_pd(&sys->parts.X[i]);
    _m512d ysq = _mm512_sqrt_pd(&sys->parts.Y[i]);
    _m512d zsq = _mm512_sqrt_pd(&sys->parts.Z[i]);
    dist[i] = xsq + ysq + zsq;
}
// deal with remainders (if any)
for (int i = n; i < sys->props.Npart; i++)
    dist[i] = sqrt(sys->parts.X[i]) + sqrt(sys->parts.Y[i]) + sqrt(sys->parts.Z[i]);

// then find lowest
for (int i = 0; i < sys->props.Npart; i++)
    if(dist[i] < sys->props.minDist) sys->props.minDist = dist[i];

Is accessing and assigning values to sys->parts.X[i] slower than creating a pointer directly to sys->parts within the function and doing parts->X[i], for instance? 例如,访问并为sys-> parts.X [i]赋值比在函数中直接创建指向sys-> parts的指针并执行parts-> X [i]慢吗?

From the compiler point of view only side-effects are important. 从编译器的角度来看,只有副作用是重要的。 I think both cases should be optimized to the same instructions by a sine compiler with a good optimization. 我认为这两种情况都应该由具有良好优化的正弦编译器优化为相同的指令。 Let's test it out: 让我们测试一下:

void function(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        dist = pow(sys->parts.X[i],2.) + pow(sys->parts.Y[i],2.) + pow(sys->parts.Z[i],2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

void function2(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        const struct Particles * const p = &sys->parts;
        dist = pow(p->X[i],2.) + pow(p->Y[i],2.) + pow(p->Z[i],2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

both function compile into the same assembly instructions, as shown at godbolt . 这两个函数都编译成相同的汇编指令,如godbolt所示。 Throughout this post I am using gcc8.2 with 64-bit x86_64 architecture. 在整个这篇文章中,我使用的是带有64位x86_64体系结构的gcc8.2。

Is having variables allocated both in heap and stack within the same struct a wise choice speed-wise? 在同一个结构中的堆和栈中分配变量是否明智? Is the program losing time trying to access these values in the memory because of this mix? 程序是否因为这种混合而浪费时间尝试访问内存中的这些值?

The real answer should be: depends on the architecture. 真正的答案应该是:取决于体系结构。 On x86_64 I believe there will be no measurable difference between accessing (not allocating) array members when: 在x86_64上,我相信在以下情况下访问(不分配)数组成员之间不会有可测量的差异:

System sys_instance;
System *sys = &sys_instance;
double Xses[NPMAX];
sys->parts.X = Xses;
double Yses[NPMAX];
sys->parts.Y = Yses;
double Zses[NPMAX];
sys->parts.Z = Zses;

and: 和:

System *sys = alloca(sizeof(*sys));
sys->parts.X = alloca(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = alloca(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = alloca(sizeof(*sys->parts.Z) * NPMAX);

and: 和:

System *sys = malloc(sizeof(*sys));
sys->parts.X = malloc(sizeof(*sys->parts.X) * NPMAX);
sys->parts.Y = malloc(sizeof(*sys->parts.Y) * NPMAX);
sys->parts.Z = malloc(sizeof(*sys->parts.Z) * NPMAX);

or any of the mix of these forms. 或这些形式的任意组合。 Whether using malloc or alloca - both result in a pointer, that from the accessing point of view is the same. 无论使用malloc还是alloca两者都产生一个指针,从访问的角度来看是相同的。 But keep in mind CPU cache and other architecture dependent optimization. 但是请记住,CPU缓存和其他依赖于体系结构的优化。 Using malloc will result in significantly "slower" allocation. 使用malloc将导致分配明显“变慢”。

Should I expect this approach to be faster than just using a global variable for each individual variable declared within the structs? 我是否希望这种方法比对结构中声明的每个单个变量使用全局变量更快?

Even if you do: 即使您这样做:

static System sys_static;
System *sys = &sys_static;
static double X_static[NPMAX];
sys->parts.X = X_static;
static double Y_static[NPMAX];
sys->parts.Y = Y_static;
static double Z_static[NPMAX];
sys->parts.Z = Z_static;

still to your function function a pointer to sys is passed and all accesses are the same. 仍然对您的函数function传递sys指针,并且所有访问均相同。

In same rare cases and when not using malloc with sys initialization having no side-effects, your function declared static and a good optimizer, it could be optimized out and the sys->props.minDist could be precalculated by the compiler on the compilation stage. 在极少数情况下,当不使用malloc进行无副作用的sys初始化时,您的函数声明为static并具有良好的优化程序,可以对其进行优化,并在编译阶段由编译器预先计算sys->props.minDist But I wouldn't aim for that, unless you want to use C++ with consteval or constexpr . 但是除非您想将C ++与constevalconstexpr一起使用,否则我不consteval

> >

If the number of X and Y and Z is the same I would go with what @WhozCraig suggested. 如果XYZ的数目相同,我将使用@WhozCraig的建议。

void function(System *sys){
    double dist;
    int i;

    for(i=0; i<sys->props.Npart; i++){
        const struct Particles * const p = &sys->parts[i];
        dist = pow(p->X, 2.) + pow(p->Y, 2.) + pow(p->Z, 2.);
        if(dist<sys->props.minDist) sys->props.minDist=dist;
    }
    return;
}

This will save cycles needed for multiplication. 这将节省乘法所需的周期。 Also it will reduce the number of malloc's needed to allocate (and resize) elements. 同样,它将减少分配(和调整大小)元素所需的malloc数量。 The sys->parts[i] part may be calculated once for the whole dist= line. sys->parts[i]部分可以为整个dist=行计算一次。 In case of sys->parts.X[i] the sys->parts may ba calculated once, then for each X and Y and Z the value pointer + sizeof(elem) * i must be calculated. sys->parts.X[i]的情况下, sys->parts.X[i] sys->parts可能被计算一次,然后对于每个XYZ ,必须计算值pointer + sizeof(elem) * i But, in case of a decent compiler and optimizer, it makes no difference. 但是,如果使用不错的编译器和优化器,则没有任何区别。 But really, this approach resulted in different assembly, but the same number of instructions, see godbolt . 但是实际上,这种方法导致了不同的汇编,但是指令的数量却相同,请参见godbolt

Definitely I would declare all the variables that denote size of an object as having size_t type, that is the loop counter i as having size_t type and sys->propc.Npart would also be size_t type. 绝对可以将所有表示对象大小的变量声明为具有size_t类型,即循环计数器i为具有size_t类型和sys->propc.Npart也为size_t类型。 They represent the element count, that's what size_t type is used for. 它们代表元素数量,这就是size_t类型的用途。

But I would definitely hand optimize the loop. 但是我肯定会手动优化循环。 You are accessing sys->props.Npart in each loop check. 您在每次循环检查中都访问sys->props.Npart If staying with pointers, I would declare double *X, *Y , *Z; 如果停留在指针上,我将声明double *X, *Y , *Z; to be restrict to each other - I suppose you don't expect them to be equal. 互相限制-我想您不希望它们相等。

Also you accessing sys->procp.minDist in each loop and conditionally assigning it. 您还可以在每个循环中访问sys->procp.minDist并有条件地对其进行分配。 You need to deference sys here only twice - on the beginning and on the end (unless you have some parallel code that depends on minDist value in mids of calculation, which I hope you don't, cause you have no means of synchronization in your current code). 您只需要在这里两次尊重minDist在开始和结束时都必须minDist sys (除非您有一些并行代码依赖计算中间的minDist值,但我希望您不要这样做,因为您无法同步)当前代码)。 Use a local variable and access sys as little as possible times you can. 使用局部变量并尽可能少地访问sys

I would replace the pow calls with variables assignment (so that the variable is derefenced only once) and plain multiplication. 我将pow调用替换为变量赋值(以便仅对变量取消引用一次)和普通乘法。 Compilers may assume the derefenced variable may change mid-loop if there are any assigments - let's protect against that. 如果有任何规定,编译器可能会假设取消引用的变量可能会在循环中更改-让我们防止这种情况发生。 However a good optimizer will optimize the pow(..., 2.) calls. 但是,好的优化程序会优化pow(..., 2.)调用。

If performance is so much needed, I would go with: 如果非常需要性能,我可以选择:

void function3(System * restrict sys){
    double minDist = sys->props.minDist;

    for (const struct Particles 
            * const start = &sys->parts[0],
            * const stop = &sys->parts[sys->props.Npart],
            * p = start; p < stop; ++p) {
        const double X = p->X;
        const double Y = p->Y;
        const double Z = p->Z;
        const double dist = X * X + Y * Y + Z * Z;
        if (dist < minDist) {
            minDist = dist;
        }
    }

    sys->props.minDist = minDist;
    return;
}

Which results in tiny bit of less assembly code, mostly because sys->propc.minDist is not accessed every time in the loop, no need to use and increment some temporary counter. 这导致了很少的汇编代码,主要是因为在循环中每次都不访问sys->propc.minDist ,无需使用并增加一些临时计数器。 Use const s so to give hints to compiler that you won't modify a variable. 使用const以便向编译器提示您不会修改变量。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM