简体   繁体   English

C ++集中SIMD使用

[英]C++ Centralizing SIMD usage

i have a library and a lot of projects depending on that library. 我有一个图书馆和很多项目,取决于该库。 I want to optimize certain procedures inside the library using SIMD extensions. 我想使用SIMD扩展优化库内的某些过程。 However it is important for me to stay portable, so to the user it should be quite abstract. 然而,对我来说保持便携是很重要的,所以对用户来说它应该是非常抽象的。 I say at the beginning that i dont want to use some other great library that does the trick. 我在开始时说,我不想使用其他一些很棒的库。 I actually want to understand if that what i want is possible and to what extent. 我实际上想要了解我想要的是可能的以及在多大程度上。

My very first idea was to have a "vector" wrapper class, that the usage of SIMD is transparent to the user and a "scalar" vector class could be used in case no SIMD extension is available on the target machine. 我的第一个想法是拥有一个“向量”包装类,SIMD的使用对用户是透明的,如果目标机器上没有SIMD扩展,则可以使用“标量”向量类。 The naive thought came to my mind to use the preprocessor to select one vector class out of many depending on which target the library is compiled. 我想到了天真的想法,使用预处理器从多个中选择一个矢量类,具体取决于编译库的目标。 So one scalar vector class, one with SSE (something like this basically: http://fastcpp.blogspot.de/2011/12/simple-vector3-class-with-sse-support.html ) and so on... all with the same interface. 所以一个标量向量类,一个带SSE(基本上是这样的: http//fastcpp.blogspot.de/2011/12/simple-vector3-class-with-sse-support.html )等...所有具有相同的界面。 This gives me good performance but this would mean that i would have to compile the library for any kind of SIMD ISA that i use. 这给了我很好的性能,但这意味着我必须为我使用的任何类型的SIMD ISA编译库。 I rather would like to evaluate the processor capabilities dynamically at runtime and select the "best" implementation available. 我宁愿在运行时动态评估处理器功能,并选择可用的“最佳”实现。

So my second guess was to have a general "vector" class with abstract methods. 所以我的第二个猜测是使用抽象方法获得一般的“向量”类。 The "processor evaluator" function would than return instances of the optimal implementation. “处理器评估程序”功能将返回最佳实现的实例。 Obviously this would lead to ugly code, but the pointer to the vector object could be stored in a smart pointer-like container that just delegates the calls to the vector object. 显然这会导致丑陋的代码,但是指向vector对象的指针可以存储在类似智能指针的容器中,该容器只是将调用委托给vector对象。 Actually I would prefer this method because of its abstraction but I'm not sure if calling the virtual methods actually will kill the performance that i gain using SIMD extensions. 实际上我更喜欢这种方法,因为它的抽象,但我不确定调用虚拟方法实际上是否会破坏我使用SIMD扩展获得的性能。

The last option that i figured out would be to do optimizations whole routines and select at runtime the optimal one. 我想出的最后一个选项是优化整个例程,并在运行时选择最佳例程。 I dont like this idea so much because this forces me to implement whole functions multiple times. 我不喜欢这个想法,因为这迫使我多次实现整个功能。 I would prefer to do this once, using my idea of the vector class i would like to do something like this for example: 我宁愿这样做一次,使用我对vector类的想法,我想做这样的事情,例如:

void Memcopy(void *dst, void *src, size_t size)
{
    vector v;
    for(int i = 0; i < size; i += v.size())
    {
        v.load(src);
        v.store(dst);
        dst += v.size();
        src += v.size();
    }
}

I assume here that "size" is a correct value so that no overlapping happens. 我在这里假设“大小”是一个正确的值,因此不会发生重叠。 This example should just show what i would prefer to have. 这个例子应该只显示我更喜欢的东西。 The size-method of the vector object would for example just return 4 in case SSE is used and 1 in case the scalar version is used. 例如,在使用SSE的情况下,向量对象的大小方法将返回4,在使用标量版本的情况下,返回1。 Is there a proper way to implement this using only runtime information without loosing too much performance? 有没有一种正确的方法来实现这一点,只使用运行时信息而不会失去太多的性能? Abstraction is to me more important than performance but as this is a performance optimization i wouldn't include it if would not speedup my application. 抽象对我来说比性能更重要,但由于这是性能优化,如果不加速我的应用程序,我不会包括它。

I also found this on the web: http://compeng.uni-frankfurt.de/?vc Its open source but i dont understand how the correct vector class is chosen. 我也在网上找到了这个: http//compeng.uni-frankfurt.de/?vc它的开源但我不明白如何选择正确的矢量类。

Your idea will only compile to efficient code if everything inlines at compile time, which is incompatible with runtime CPU dispatching. 如果所有内容都在编译时内联,那么您的想法将只编译为高效代码,这与运行时CPU调度不兼容。 For v.load(), v.store(), and v.size() to actually be different at runtime depending on the CPU, they'd have to be actual function calls, not single instructions. 对于v.load(),v.store()和v.size()在运行时实际上根据CPU而不同,它们必须是实际的函数调用,而不是单个指令。 The overhead would be killer. 开销将是杀手。


If your library has functions that are big enough to work without being inlined, then function pointers are great for dispatching based on runtime CPU detection. 如果您的库具有足够大的功能而无需内联工作,那么函数指针非常适合基于运行时CPU检测的调度。 (eg make multiple versions of memcpy, and pay the overhead of runtime detection once per call, not twice per loop iteration.) (例如,制作memcpy的多个版本,并为每次调用支付一次运行时检测的开销,而不是每次循环迭代两次。)

This shouldn't be visible in your library's external API/ABI , unless your functions are mostly so short that the overhead of an extra (direct) call/ret matters. 应该在您的库的外部API / ABI中不可见 ,除非您的函数非常短,以至于额外(直接)调用/ ret的开销很重要。 In the implementation of your library functions, put each sub-task that you want to make a CPU-specific version of into a helper function. 在库函数的实现中,将要创建CPU特定版本的每个子任务放入辅助函数中。 Call those helper functions through function pointers. 通过函数指针调用这些辅助函数。


Start with your function pointers initialized to versions that will work on your baseline target. 从函数指针开始,将其初始化为适用于基线目标的版本。 eg SSE2 for x86-64, scalar or SSE2 for legacy 32bit x86 (depending on whether you care about Athlon XP and Pentium III), and probably scalar for non-x86 architectures. 例如,用于x86-64的SSE2,用于传统32位x86的标量或SSE2(取决于您是否关心Athlon XP和Pentium III),以及可能是非x86架构的标量。 In a constructor or library init function, do a CPUID and update the function pointers to the best version for the host CPU. 在构造函数或库init函数中,执行CPUID并将函数指针更新为主机CPU的最佳版本。 Even if your absolute baseline is scalar, you could make your "good performance" baseline something like SSSE3, and not spend much/any time on SSE2-only routines. 即使您的绝对基线是标量,您也可以使您的“良好性能”基线像SSSE3一样,并且不会花费太多/任何时间在仅SSE2的例程上。 Even if you're mostly targetting SSSE3, some of your routines will probably end up only requiring SSE2, so you might as well mark them as such and let the dispatcher use them on CPUs that only do SSE2. 即使你主要针对SSSE3,你的一些例程可能最终只需要SSE2,所以你也可以将它们标记为这样,并让调度程序在仅执行SSE2的CPU上使用它们。

Updating the function pointers shouldn't even require any locking. 更新函数指针甚至不需要任何锁定。 Any calls that happen from other threads before your constructor is done setting function pointers may get the baseline version, but that's fine. 在构造函数完成设置函数指针之前从其他线程发生的任何调用都可能获得基线版本,但这很好。 Storing a pointer to an aligned address is atomic on x86. 在x86上存储指向对齐地址的指针是原子的。 If it's not atomic on any platform where you have a version of a routine that needs runtime CPU detection, use C++ std:atomic (with memory-order relaxed stores and loads, not the default sequential consistency which would trigger a full memory barrier on every load). 如果它在任何需要运行时CPU检测的例程的平台上都不是原子的,那么使用C ++ std:atomic(使用内存顺序缓存存储和加载,而不是默认的顺序一致性,这将导致每个内存都有一个完整的内存屏障)加载)。 It matters a lot that there's minimal overhead when calling through the function pointers, and it doesn't matter what order different threads see the changes to the function pointers. 通过函数指针调用时,开销很小很重要,并且不同的线程看到函数指针的更改的顺序并不重要。 They're write-once. 他们只写了一次。


x264 (the heavily-optimized open source h.264 video encoder) uses this technique extensively, with arrays of function pointers. x264(经过大量优化的开源h.264视频编码器)广泛使用这种技术,具有函数指针数组。 See x264_mc_init_mmx() , for example. 例如,请参阅x264_mc_init_mmx() (That function handles all CPU dispatching for Motion Compensation functions, from MMX to AVX2). (该功能处理运动补偿功能的所有CPU调度,从MMX到AVX2)。 I assume libx264 does the CPU dispatching in the "encoder init" function. 我假设libx264在“编码器初始化”功能中执行CPU调度。 If you don't have a function that users of your library are required to call, then you should look into some kind of mechanism for running global constructor / init functions when programs using your library start up. 如果您没有要求库的用户调用的函数,那么当您使用库的程序启动时,您应该研究某种运行全局构造函数/初始化函数的机制。


If you want this to work with very C++ey code (C++ish? Is that a word?) ie templated classes & functions, the program using the library will probably have do the CPU dispatching, and arrange to get baseline and multiple CPU-requirement versions of functions compiled. 如果你想让它与非常C ++的代码一起使用(C ++ ish?这是一个词吗?),即模板化的类和函数,使用该库的程序可能会进行CPU调度,并安排获得基线和编译的多个CPU要求版本的函数。

I do exactly this with a fractal project. 我正是通过分形项目完成的。 It works with vector sizes of 1, 2, 4, 8, and 16 for float and 1, 2, 4, 8 for double. 它适用于浮动的矢量大小为1,2,4,8和16,双重的1,2,4,8的矢量大小。 I use a CPU dispatcher at run-time to select the following instructions sets: SSE2, SSE4.1, AVX, AVX+FMA, and AVX512. 我在运行时使用CPU调度程序来选择以下指令集:SSE2,SSE4.1,AVX,AVX + FMA和AVX512。

The reason I use a vector size of 1 is to test performance. 我使用矢量大小为1的原因是测试性能。 There is already a SIMD library that does all this: Agner Fog's Vector Class Library . 已经有一个SIMD库可以完成所有这些:Agner Fog的Vector Class Library He even includes example code for a CPU dispatcher. 他甚至包括CPU调度程序的示例代码。

The VCL emulates hardware such as AVX on systems that only have SSE (or even AVX512 for SSE). VCL在仅具有SSE(甚至用于SSE的AVX512)的系统上模拟AVX等硬件。 It just implements AVX twice (for four times for AVX512) so in most cases you can just use the largest vector size you want to target. 它只实现AVX两次(对于AVX512为四次),因此在大多数情况下,您可以使用您想要定位的最大矢量大小。

//#include "vectorclass.h"
void Memcopy(void *dst, void *src, size_t size)
{
    Vec8f v; //eight floats using AVX hardware or AVX emulated with SSE twice.
    for(int i = 0; i < size; i +=v.size())
    {
        v.load(src);
        v.store(dst);
        dst += v.size();
        src += v.size();
    }
}

( however, writing an efficient memcpy is complicating . For large sizes you should consider non temroal stores and on IVB and above use rep movsb instead). 但是,写一个有效的memcpy是很复杂的 。对于大尺寸你应该考虑非temroal商店和IVB及以上使用rep movsb )。 Notice that that code is identical to what you asked for except I changed the word vector to Vec8f . 请注意,除了我将单词vector更改为Vec8f之外,该代码与您要求的相同。

Using the VLC, as CPU dispatcher, templating, and macros you can write your code/kernel so that it looks nearly identical to scalar code without source code duplication for every different instruction set and vector size. 使用VLC,作为CPU调度程序,模板和宏,您可以编写代码/内核,使其看起来几乎与标量代码完全相同,而不需要为每个不同的指令集和向量大小复制源代码。 It's your binaries which will be bigger not your source code. 这是你的二进制文件,它将更大而不是你的源代码。

I have described CPU dispatchers several times . 我已多次描述CPU调度程序 You can also see some example using templateing and macros for a dispatcher here: alias of a function template 您还可以在此处查看一些使用模板和宏的调度程序示例: 函数模板的别名

Edit: Here is an example of part of my kernel to calculate the Mandelbrot set for a set of pixels equal to the vector size. 编辑:这是我的内核的一部分示例,用于计算一组等于矢量大小的像素的Mandelbrot集。 At compile time I set TYPE to float , double , or doubledouble and N to 1, 2, 4, 8, or 16. The type doubledouble is described here which I created and added to the VCL. 在编译时我将TYPE设置为floatdoubledoubledouble ,将N设置为1,2,4,8或16. 这里描述 doubledouble类型,我创建并添加到VCL。 This produces Vector types of Vec1f, Vec4f, Vec8f, Vec16f, Vec1d, Vec2d, Vec4d, Vec8d, doubledouble1, doubledouble2, doubledouble4, doubledouble8. 这产生Vector类型的Vec1f,Vec4f,Vec8f,Vec16f,Vec1d,Vec2d,Vec4d,Vec8d,doubledouble1,doubledouble2,doubledouble4,doubledouble8。

template<typename TYPE, unsigned N>
static inline intn calc(floatn const &cx, floatn const &cy, floatn const &cut, int32_t maxiter) {
    floatn x = cx, y = cy;
    intn n = 0; 
    for(int32_t i=0; i<maxiter; i++) {
        floatn x2 = square(x), y2 = square(y);
        floatn r2 = x2 + y2;
        booln mask = r2<cut;
        if(!horizontal_or(mask)) break;
        add_mask(n,mask);
        floatn t = x*y; mul2(t);
        x = x2 - y2 + cx;
        y = t + cy;
    }
    return n;
}

So my SIMD code for several several different data types and vector sizes is nearly identical to the scalar code I would use. 因此,我的几种不同数据类型和矢量大小的SIMD代码几乎与我将使用的标量代码相同。 I have not included the part of my kernel which loops over each super-pixel. 我没有包含我的内核中循环每个超像素的部分。

My build file looks something like this 我的构建文件看起来像这样

g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2          -Ivectorclass  kernel.cpp -okernel_sse2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse4.1        -Ivectorclass  kernel.cpp -okernel_sse41.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx           -Ivectorclass  kernel.cpp -okernel_avx.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma    -Ivectorclass  kernel.cpp -okernel_avx2.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx2 -mfma    -Ivectorclass  kernel_fma.cpp -okernel_fma.o
g++ -m64 -c -Wall -g -std=gnu++11 -O3 -fopenmp -mfpmath=sse -mavx512f -mfma -Ivectorclass  kernel.cpp -okernel_avx512.o
g++ -m64 -Wall -Wextra -std=gnu++11 -O3 -fopenmp -mfpmath=sse -msse2 -Ivectorclass frac.cpp vectorclass/instrset_detect.cpp kernel_sse2.o kernel_sse41.o kernel_avx.o kernel_avx2.o kernel_avx512.o kernel_fma.o -o frac

Then the dispatcher looks something like this 然后调度员看起来像这样

int iset = instrset_detect();
fp_float1  = NULL; 
fp_floatn  = NULL;
fp_double1 = NULL;
fp_doublen = NULL;
fp_doublefloat1  = NULL;
fp_doublefloatn  = NULL;
fp_doubledouble1 = NULL;
fp_doubledoublen = NULL;
fp_float128 = NULL;
fp_floatn_fma = NULL;
fp_doublen_fma = NULL;

if (iset >= 9) {
    fp_float1  = &manddd_AVX512<float,1>;
    fp_floatn  = &manddd_AVX512<float,16>;
    fp_double1 = &manddd_AVX512<double,1>;
    fp_doublen = &manddd_AVX512<double,8>;
    fp_doublefloat1  = &manddd_AVX512<doublefloat,1>;
    fp_doublefloatn  = &manddd_AVX512<doublefloat,16>;
    fp_doubledouble1 = &manddd_AVX512<doubledouble,1>;
    fp_doubledoublen = &manddd_AVX512<doubledouble,8>;
}
else if (iset >= 8) {
    fp_float1  = &manddd_AVX<float,1>;
    fp_floatn  = &manddd_AVX2<float,8>;
    fp_double1 = &manddd_AVX2<double,1>;
    fp_doublen = &manddd_AVX2<double,4>;
    fp_doublefloat1  = &manddd_AVX2<doublefloat,1>;
    fp_doublefloatn  = &manddd_AVX2<doublefloat,8>;
    fp_doubledouble1 = &manddd_AVX2<doubledouble,1>;
    fp_doubledoublen = &manddd_AVX2<doubledouble,4>;
}
....

This sets function pointers to each of the different possible datatype vector combination for the instruction set found at runtime. 这将为运行时找到的指令集的每个不同的可能数据类型向量组合设置函数指针。 Then I can call whatever function I'm interested. 然后我可以打电话给我感兴趣的任何功能。

Thanks Peter Cordes and Z boson. 谢谢Peter Cordes和Z boson。 With your both replies II came to a solution that satisfies me. 随着你的回复,我找到了满足我的解决方案。 I chose the Memcopy just as an example just because of everyone knowing it and its beautiful simplicity (but also slowness) when implemented naively in contrast to SIMD optimizations that are often not well readable anymore but of course much faster. 我选择Memcopy只是为了一个例子,因为每个人都知道它以及它的美丽简洁(但也很慢),与SIMD优化相比,而SIMD优化通常不再可读,但当然要快得多。 I have now two classes (more possible of course) a scalar vector and an SSE vector both with inline methods. 我现在有两个类(当然更可能)标量向量和SSE向量都使用内联方法。 To the user i show something like: typedef void(*MEM_COPY_FUNC)(void *, const void *, size_t); 给用户我显示的东西:typedef void(* MEM_COPY_FUNC)(void *,const void *,size_t);

extern MEM_COPY_FUNC memCopyPointer;

I declare my function something like this, as Z boson pointed out: template void MemCopyTemplate(void *pDest, const void *prc, size_t size) { VectorType v; 我声明我的函数是这样的,正如Z boson所指出的:template void MemCopyTemplate(void * pDest,const void * prc,size_t size){VectorType v; byte *pDst, *pSrc; byte * pDst,* pSrc; uint32 mask; uint32面具;

    pDst = (byte *)pDest;
    pSrc = (byte *)prc;

    mask = (2 << v.GetSize()) - 1;
    while(size & mask)
    {
        *pDst++ = *pSrc++;
    }

    while(size)
    {
        v.Load(pSrc);
        v.Store(pDst);

        pDst += v.GetSize();
        pSrc += v.GetSize();
        size -= v.GetSize();
    }
}

And at runtime, when the library is loaded, i use CPUID to do either 在运行时,当加载库时,我使用CPUID来做任何一个

memCopyPointer = MemCopyTemplate<ScalarVector>;

or 要么

memCopyPointer = MemCopyTemplate<SSEVector>;

as you both suggested. 正如你们两个人所说 Thanks a lot. 非常感谢。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM