简体   繁体   English

如何以最小的性能损失同步C和C ++库?

[英]How to synchronize C & C++ libraries with minimal performance penalty?

I have a C library with numerous math routines for dealing with vectors, matrices, quaternions and so on. 我有一个带有很多数学例程的C库,用于处理矢量,矩阵,四元数等。 It needs to remain in C because I often use it for embedded work and as a Lua extension. 它需要保留在C中,因为我经常将它用于嵌入式工作和Lua扩展。 In addition, I have C++ class wrappers to allow for more convenient object management and operator overloading for math operations using the C API. 此外,我还有C ++类包装器,可以使用C API为数学运算提供更方便的对象管理和运算符重载。 The wrapper only consists of a header file and as much use on inlining is made as possible. 包装器仅包含一个头文件,并在内联中尽可能多地使用它。

Is there an appreciable penalty for wrapping the C code versus porting and inlining the implementation directly into the C++ class? 包装C代码与将实现直接移植和内联到C ++类中是否有明显的代价? This library is used in time critical applications. 该库用于时间紧迫的应用程序。 So, does the boost from eliminating indirection compensate for the maintenance headache of two ports? 那么,消除间接带来的好处是否可以弥补两个端口的维护麻烦?

Example of C interface: C接口示例:

typedef float VECTOR3[3];

void v3_add(VECTOR3 *out, VECTOR3 lhs, VECTOR3 rhs);

Example of C++ wrapper: C ++包装器示例:

class Vector3
{
private:
    VECTOR3 v_;

public:
    // copy constructors, etc...

    Vector3& operator+=(const Vector3& rhs)
    {
        v3_add(&this->v_, this->v_, const_cast<VECTOR3> (rhs.v_));
        return *this;
    }

    Vector3 operator+(const Vector3& rhs) const
    {
        Vector3 tmp(*this);
        tmp += rhs;
        return tmp;
    }

    // more methods...
};

如果只是将C库调用包装在C ++类函数中(换句话说,C ++函数除了调用C函数外什么也不做),那么编译器将优化这些调用,以免降低性能。

As with any question about performance, you'll be told to measure to get your answer (and that's the strictly correct answer). 与性能相关的任何问题一样,系统会告诉您进行度量以获取答案(这是严格正确的答案)。

But as a rule of thumb, for simple inline methods that can actually be inlined, you'll see no performance penalty. 但是根据经验,对于实际上可以内联的简单内联方法,您不会看到性能损失。 In general, an inline method that does nothing but pass the call onto another function is a great candidate for inlining. 通常,仅将调用传递给另一个函数的内联方法是内联的理想选择。

However, even if your wrapper methods were not inlined, I suspect you'd notice no performance penalty - not even a measurable one - unless the wrapper method was being called in some critical loop. 但是,即使没有内联您的包装方法,我怀疑您不会注意到性能下降-甚至没有可衡量的性能下降-除非在某些关键循环中调用包装方法。 Even then it would likely only be measurable if the wrapped function itself didn't do much work. 即使这样,如果包装的函数本身没有做很多工作,也可能只能测量。

This type of thing is about the last thing to be concerned about. 这种事情是最后要关注的事情。 First worry about making your code correct, maintainable, and that you're using appropriate algorithms. 首先要担心使您的代码正确,可维护以及您使用的是适当的算法。

As usual with everything related to optimization, the answer is that you have to measure the performance itself before you know if the optimization is worthwhile. 与所有与优化相关的事情一样,答案是,在知道优化是否值得之前,必须先测量性能本身。

  • Benchmark two different functions, one calling the C-style functions directly and another calling through the wrapper. 对两个不同的函数进行基准测试,一个函数直接调用C样式函数,另一个函数通过包装器调用。 See which one runs faster, or if the difference is within the margin of error of your measurement (which would mean there is no difference you can measure). 查看哪一个运行速度更快,或者差异是否在测量误差范围内(这意味着您可以测量的差异不大)。
  • Look at the assembly code generated by the two functions in the previous step (on gcc, use -S or -save-temps ). 查看上一步中两个函数生成的汇编代码(在gcc上,使用-S-save-temps )。 See if the compiler did something stupid, or if your wrappers have any performance bug. 查看编译器是否做过一些愚蠢的事情,或者您的包装器是否有任何性能错误。

Unless the performance difference is too big in favor of not using the wrapper, reimplementing is not a good idea, since you risk introducing bugs (which could even cause results which look sane but are wrong). 除非性能差异太大而不支持不使用包装程序,否则重新实现不是一个好主意,因为您可能会引入错误(甚至可能导致看起来很合理但错误的结果)。 Even if the difference is big, it would be simpler and less risky to just remember C++ is very compatible with C and use your library in the C style even within C++ code. 即使差异很大,仅记住C ++与C高度兼容并且即使在C ++代码中也可以使用C风格的库,这样会更简单且风险更低。

Your wrapper itself will be inlined, however, your method calls to the C library typically will not. 包装器本身将被内联,但是,对C库的方法调用通常不会被内联。 (This would require link-time-optimizations which are technically possible, but to AFAIK rudimentary at best in todays tools) (这将需要对链接时间进行优化,这在技术上是可行的,但在当今的工具中充其量只能满足AFAIK的基本要求)

Generally, a function call as such is not very expensive. 通常,这样的函数调用不是很昂贵。 The cycle cost has decreased considerably over the last years, and it can be predicted easily, so the the call penalty as such is negligible. 在过去的几年中,周期成本已大大降低,并且可以轻松预测,因此通话费用可以忽略不计。

However, inlining opens the door to more optimizations: if you have v = a + b + c, your wrapper class forces the generation of stack variables, whereas for inlined calls, the majority of the data can be kept in the FPU stack. 但是,内联为更多优化打开了大门:如果您拥有v = a + b + c,则包装器类将强制生成堆栈变量,而对于内联调用,大多数数据可以保留在FPU堆栈中。 Also, inlined code allows simplifying instructions, considering constant values, and more. 而且,内联代码允许简化指令,考虑常量值等。

So while the measure before you invest rule holds true, I would expect some room for improvements here. 因此,尽管在投资规则之前措施适用,但我希望这里有一些改进的空间。


A typical solution is to bring the C implementaiton into a format that it can be used either as inline functions or as "C" body: 一种典型的解决方案是将C实现转换为既可以用作内联函数也可以用作“ C”主体的格式:

// V3impl.inl
void V3DECL v3_add(VECTOR3 *out, VECTOR3 lhs, VECTOR3 rhs)
{
    // here you maintain the actual implementations
    // ...
}

// C header
#define V3DECL 
void V3DECL v3_add(VECTOR3 *out, VECTOR3 lhs, VECTOR3 rhs);

// C body
#include "V3impl.inl"


// CPP Header
#define V3DECL inline
namespace v3core {
  #include "V3impl.inl"
} // namespace

class Vector3D { ... }

This likely makes sense only for selected methods with comparedly simple bodies. 这仅对于具有相对简单主体的选定方法才有意义。 I'd move the methods to a separate namespace for the C++ implementation, as you will usually not need them directly. 我会将这些方法移到C ++实现的一个单独的命名空间中,因为通常不需要直接使用它们。

(Note that the inline is just a compiler hint, it doesn't force the method to be inlined. But that's good: if the code size of an inner loop exceeds the instruction cache, inlining easily hurts performance) (请注意,内联只是编译器提示,不会强制内联该方法。但这很好:如果内部循环的代码大小超出指令缓存,则内联很容易影响性能)

Whether the pass/return-by-reference can be resolved depends on the strength of your compiler, I've seen many where foo(X * out) forces stack variables, whereas X foo() does keep values in registers. 是否可以解析通过引用/返回引用取决于编译器的强度,我已经看到很多在foo(X * out)强制堆栈变量的地方,而X foo()确实将值保存在寄存器中。

I don't think you'll notice much perf difference. 我认为您不会注意到很多性能差异。 Assuming your target platform support all your data types, 假设您的目标平台支持所有数据类型,

I'm coding for the DS and a few other ARM devices and floating points are evil...I had to typedef float to FixedPoint<16,8> 我正在为DS和其他一些ARM设备进行编码,并且浮点数是有害的...我不得不将defdef类型输入为FixedPoint <16,8>

If you are worried that the overhead of calling functions is slowing you down, why not test inlining the C code or turning it into macros? 如果您担心调用函数的开销会拖慢您的速度,为什么不测试内联C代码或将其转换为宏呢?

Also, why not improve the const correctness of the C code while you are at it - const_cast should really be used sparingly, especially on interfaces you control. 另外,为什么不在使用C时提高C代码的const正确性-确实应该谨慎使用const_cast,尤其是在您控制的接口上。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM