如何加速此代码（MWE！），例如使用限制

Question

Is there any way I can accelerate this function:有什么办法可以加速这个功能：

void task(int I, int J, int K, int *L, int **ids, double *bar){ 
    double *foo[K];
    for (int k=0;k<K;k++)
        foo[k] = new double[I*L[k]];
        // I am filling these arrays somehow
        // This is not a bottleneck, hence omitted here        
    for (int i=0;i<I;i++)
        for (int j=0;j<J;j++){
            double tmp = 1.;
            for (int k=0;k<K;k++)
                tmp *= foo[k][i*L[k]+ids[j][k]]; //ids[j][k]<L[k]
            bar[i*J+j] = tmp;
        }
}

Typical values are: I = 100,000 , J = 10,000 , K=3 , L=[50,20,60] .典型值是： I = 100,000 , J = 10,000 , K=3 , L=[50,20,60] 。

I read that the __restrict__ keyword/extension could help, but am not sure how to apply it here.我读到__restrict__关键字/扩展名可能会有所帮助，但我不确定如何在此处应用它。 For example, trying to put it into the definition of foo[k] = new double[...] I get error: '__restrict_ qualifiers cannot be applied to double .例如，试图将它放入foo[k] = new double[...]的定义中，我得到error: '__restrict_ qualifiers cannot be applied to double 。 Furthermore, I don't know whether I should / how I could declare ids and ids[j], 1<= j<= J as restricted.此外，我不知道我是否应该/如何将ids和ids[j], 1<= j<= J为受限制的。

As a note, in my actual code, I execute as such tasks in parallel in as many threads as my CPU has cores.需要注意的是，在我的实际代码中，我在与 CPU 内核数一样多的线程中并行执行此类任务。

I am writing mostly C-compatible C++, so solutions in both languages are welcome.我主要编写与 C 兼容的 C++，因此欢迎使用两种语言的解决方案。

Answer 1

https://en.cppreference.com/w/c/language/restrict claims you can declare an array of restrict pointers to double like so in C99/C11: https://en.cppreference.com/w/c/language/restrict声称您可以像在 C99/C11 中那样声明一个restrict指针数组为 double：

typedef double *array_t[10];
restrict array_t foo;        // the type of a is double *restrict[10]

But only gcc accepts that.但只有 gcc 接受。 I think this is a GCC-ism, not valid ISO C11.我认为这是 GCC 主义，无效的 ISO C11。 (gcc also accepts （gcc 也接受
array_t restrict foo_r; but no other compilers accept that either.)但也没有其他编译器接受这一点。）

ICC warns "restrict" is not allowed , clang rejects it with ICC 警告"restrict" is not allowed ，clang 拒绝了它

<source>:16:5: error: restrict requires a pointer or reference ('array_t' (aka 'double *[10]') is invalid)
    restrict array_t foo_r;
    ^

MSVC rejects it with error C2219: syntax error: type qualifier must be after '*' MSVC 以error C2219: syntax error: type qualifier must be after '*'拒绝它error C2219: syntax error: type qualifier must be after '*'

We get essentially the same behaviour in C++ from these compilers with __restrict , which they accept as a C++ extension with the same semantics as C99 restrict .我们从这些带有__restrict编译器中获得了与 C++ 基本相同的行为，它们作为 C++ 扩展接受，其语义与 C99 restrict相同。

As a workaround, you can instead use a qualified temporary pointer every time you read from foo , instead of f[k][stuff] .作为一种解决方法，您可以在每次从foo读取时使用限定的临时指针，而不是f[k][stuff] 。 I think this promises that the memory you reference through fk isn't the same memory you access through any other pointers within the block where fk is declared.我认为这保证了您通过fk引用的内存与您通过声明fk的块内的任何其他指针访问的内存不同。

double *__restrict fk = foo[k];
tmp *= fk[ stuff ];

I don't know how to promise the compiler that none of the f[0..K-1] pointers alias each other.我不知道如何向编译器保证f[0..K-1]指针互不别名。 I don't think this accomplishes that.我不认为这可以实现。

You don't need __restrict here.你不需要 __restrict 在这里。

I added __restrict to all the pointer declarations, like int *__restrict *__restrict ids and it doesn't change the asm at all, according to a diff pane on the Godbolt compiler explorer: https://godbolt.org/z/4YjlDA .根据 Godbolt 编译器资源管理器上的差异窗格： https ://godbolt.org/z/4YjlDA，我将__restrict添加到所有指针声明中，例如int *__restrict *__restrict ids并且它根本不会更改 asm。 As we'd expect because type-based aliasing lets the compiler assume that a double store into bar[] doesn't modify any of the int * elements of int *ids[] .正如我们所期望的，因为基于类型的别名让编译器假设double存储到bar[]不会修改int *ids[]任何int *元素。 As people said in comments, there's no aliasing here that the compiler can't already sort out.正如人们在评论中所说，这里没有编译器无法解决的别名。 And in practice it appears that it does sort it out, without any extra reloads of pointers.在实践中，它似乎确实解决了问题，没有任何额外的指针重新加载。

It also can't alias *foo[k] , because we got those pointers with new inside this function.它也不能别名*foo[k] ，因为我们在这个函数中得到了那些带有new指针。 They can't be pointing inside bar[] .他们不能指向bar[]内部。

(All the major x86 C++ compilers (GCC,clang,ICC,MSVC) support __restrict in C++ with the same behaviour as C99 restrict : a promise to the compiler that stores through this pointer don't modify objects that are pointed to by another pointer. I'd recommend __restrict over __restrict__ , at least if you mostly want portability across x86 compilers. I'm not sure about outside of that.) （所有主要的 x86 C++ 编译器（GCC、clang、ICC、MSVC）都支持 C++ 中的__restrict ，其行为与 C99 restrict相同：对通过此指针存储的编译器的承诺不会修改另一个指针指向的对象。我建议__restrict优于__restrict__ ，至少如果您主要希望跨 x86 编译器具有可移植性。我不确定在此之外。）

It looks like you're saying you tried to put __restrict__ into an assignment, not a declaration .看起来您是在说您试图将__restrict__放入赋值中，而不是声明中。 That won't work, it's the pointer variable itself that __restrict applies to, not a single assignment to it.那是行不通的， __restrict适用的是指针变量本身，而不是对其的单个赋值。

The first version of the question had a bug in the inner loop: it had K++ instead of k++ , so it was pure undefined behaviour and the compilers got weird.问题的第一个版本在内循环中有一个错误：它有K++而不是k++ ，所以它是纯粹的未定义行为，编译器变得很奇怪。 The asm didn't make any sense (eg no FP multiply instruction, even when foo[] was a function arg). asm 没有任何意义（例如，没有 FP 乘法指令，即使foo[]是函数 arg）。 This is why it's a good idea to use a name like klen instead of K for an array dimension.这就是为什么对数组维度使用klen之类的名称而不是K类的名称是个好主意的原因。

After fixing that on the Godbolt link, there's still no difference in the asm with / without __restrict on everything, but it's a lot more sane.在 Godbolt 链接上修复该问题后，asm 与 / 无__restrict的所有内容仍然没有区别，但它更加理智。

BTW, making double *foo[] a function arg would let us look at the asm for just the main loop.顺便说一句，将double *foo[]设为函数 arg 会让我们只查看主循环的 asm。 And you would actually need __restrict because a store to bar[] could modify an element of foo[][] .而且您实际上需要__restrict因为bar[]的商店可以修改foo[][]的元素。 This doesn't happen in your function because the compiler knows that new memory isn't pointed-to by any existing pointers , but it wouldn't know that if foo was a function arg.这不会发生在您的函数中，因为编译器知道任何现有的指针都没有指向new内存，但它不知道foo是否是函数 arg。

There's a small amount of the work inside the loop is sign-extending 32-bit int results before using them as array indices with 64-bit pointers.在将 32 位int结果用作具有 64 位指针的数组索引之前，循环中的少量工作是对其进行符号扩展。 This adds a cycle of latency in there somewhere, but not the loop-carried FP multiply dependency chain so it may not matter.这在某处增加了一个延迟周期，但不是循环携带的 FP 乘法依赖链，所以这可能无关紧要。 You can get rid of one instruction inside the inner loop on x86-64 by using size_t k=0;您可以通过使用size_t k=0;来摆脱 x86-64 上的内循环中的一条指令size_t k=0; as the inner-most loop counter.作为最里面的循环计数器。 L[] is a 32-bit array, so i*L[k] needs to be sign-extended inside the loop. L[]是一个 32 位数组，因此i*L[k]需要在循环内进行符号扩展。 Zero-extension from 32 to 64-bit happens for free on x86-64, so i * (unsigned)L[k] saves a movsx instruction in the pointer-chasing dep chain.从 32 位到 64 位的零扩展在 x86-64 上是免费的，所以i * (unsigned)L[k]在指针追踪 dep 链中保存了一条movsx指令。 Then the inner loop that gcc8.2 makes is all necessary work, required by your nasty data structures / layout.然后 gcc8.2 进行的内部循环是所有必要的工作，这是您讨厌的数据结构/布局所必需的。 https://godbolt.org/z/bzVSZ7 https://godbolt.org/z/bzVSZ7

I don't know whether that's going to make a difference or not.我不知道这是否会有所作为。 I think more likely the memory access pattern causing cache misses will be your bottleneck with real data.我认为导致缓存未命中的内存访问模式更有可能成为您处理真实数据的瓶颈。

It also can't auto-vectorize because the data isn't contiguous.它也不能自动矢量化，因为数据不连续。 You can't get contiguous source data from looping over j or i , though.但是，您无法通过循环j或i获得连续的源数据。 At least i would be a simple stride without having to redo ids[j][k] .至少i会是一个简单的大步，而不必重做ids[j][k] 。

If you generate foo[k][...] and bar[...] transposed, so you index with foo[k][ i + L[k] * ids[j][k] ] , then you'd have contiguous memory in src and dst so you (or the compiler) could use SIMD multiplies.如果你生成foo[k][...]和bar[...]转置，所以你用foo[k][ i + L[k] * ids[j][k] ]索引，那么你会在 src 和 dst 中具有连续内存，因此您（或编译器）可以使用 SIMD 乘法。

Answer 2

restrict does not matter in this case.在这种情况下， restrict无关紧要。

Your algorithm is rubbish and does not allow long vector operations to be used (so micro optimizations will not help here at all).你的算法是垃圾，不允许使用长向量操作（所以微优化在这里根本无济于事）。

You need to find the way that the elements in the inner loop to occupy the consecutive block of array indexes.您需要找到内部循环中的元素占用数组索引的连续块的方式。 As it is done now the compiler has to read the every single element from different positions in the array, it bares the compiler from the loops unrolling and longer vector instructions.正如现在所做的那样，编译器必须从数组中的不同位置读取每个元素，它使编译器免受循环展开和更长的向量指令的影响。 It may be also very cache memory unfriendly.它也可能对高速缓存非常不友好。

Rethink the algorithm first - premature optimizations will not help if the algorithm is extremely inefficient首先重新考虑算法 - 如果算法效率极低，过早的优化将无济于事

Edit编辑

After the OP comment I just want to show him waht is the difference between "naive" and more efficient (less naive but harder to understand one)在 OP 评论之后，我只想向他展示“天真”和更高效（不那么天真但更难理解）之间的区别

Lets consider the parity of the 32 bit unsigned value.让我们考虑 32 位无符号值的奇偶校验。 The naive approach:天真的方法：

int very_naive_parity(const uint32_t val)
{
    unsigned parity = 0;
    
    for(unsigned bit = 0; bit < 32; bit++)
    {
        if(val & (1U << bit))
        {
            parity = !parity;
        }
    }
    return parity;
}

It is very easy to write and understand but it is extremely inefficient.它很容易编写和理解，但效率极低。 At least 288 instructions will be executed to calculate this parity.至少会执行 288 条指令来计算这个奇偶校验。

more efficient:更高效：

int parity(const uint32_t val)
{
    uint32_t tmp = val;
    
    tmp ^= tmp >> 16;
    tmp ^= tmp >> 8;
    tmp ^= tmp >> 4;
    return (0b110100110010110 >> (tmp & 0x0f)) & 1;
}

will be executed in 9 instructions (both calculations without function prologues and epilogues) Is it harder to understand?将在9条指令中执行（计算都没有函数序言和尾声）是不是更难理解？ - definitely yes. - 肯定是的。 But as I wrote efficiency usually means less easy for humans.但正如我所写的，效率通常对人类来说并不那么容易。

如何加速此代码（MWE！），例如使用限制

问题描述

2 个解决方案

解决方案1
2 已采纳 2019-01-06 10:03:43

解决方案2
0 2019-01-06 10:34:17

Edit编辑

如何加速此代码（MWE！），例如使用限制

问题描述

2 个解决方案

解决方案1 2 已采纳 2019-01-06 10:03:43

解决方案2 0 2019-01-06 10:34:17

Edit编辑

解决方案1
2 已采纳 2019-01-06 10:03:43

解决方案2
0 2019-01-06 10:34:17