简体   繁体   English

C#中的数学优化

[英]Math optimization in C#

I've been profiling an application all day long and, having optimized a couple bits of code, I'm left with this on my todo list.我整天都在分析一个应用程序,优化了一些代码后,我把这个留在了我的待办事项列表中。 It's the activation function for a neural network, which gets called over a 100 million times.它是神经网络的激活函数,被调用超过 1 亿次。 According to dotTrace, it amounts to about 60% of the overall function time.据 dotTrace 称,它占整个函数时间的 60% 左右。

How would you optimize this?你会如何优化这个?

public static float Sigmoid(double value) {
    return (float) (1.0 / (1.0 + Math.Pow(Math.E, -value)));
}

Try:尝试:

public static float Sigmoid(double value) {
    return 1.0f / (1.0f + (float) Math.Exp(-value));
}

EDIT: I did a quick benchmark.编辑:我做了一个快速的基准测试。 On my machine, the above code is about 43% faster than your method, and this mathematically-equivalent code is the teeniest bit faster (46% faster than the original):在我的机器上,上面的代码比你的方法快 43%,这个数学上等效的代码是最慢的(比原始代码快 46%):

public static float Sigmoid(double value) {
    float k = Math.Exp(value);
    return k / (1.0f + k);
}

EDIT 2: I'm not sure how much overhead C# functions have, but if you #include <math.h> in your source code, you should be able to use this, which uses a float-exp function.编辑 2:我不确定 C# 函数有多少开销,但是如果你在源代码中#include <math.h> ,你应该能够使用它,它使用一个 float-exp 函数。 It might be a little faster.可能会快一点。

public static float Sigmoid(double value) {
    float k = expf((float) value);
    return k / (1.0f + k);
}

Also if you're doing millions of calls, the function-calling overhead might be a problem.此外,如果您要进行数百万次调用,则函数调用开销可能是一个问题。 Try making an inline function and see if that's any help.尝试创建一个内联函数,看看是否有帮助。

If it's for an activation function, does it matter terribly much if the calculation of e^x is completely accurate?如果是用于激活函数,那么如果 e^x 的计算完全准确,那么重要吗?

For example, if you use the approximation (1+x/256)^256, on my Pentium testing in Java (I'm assuming C# essentially compiles to the same processor instructions) this is about 7-8 times faster than e^x (Math.exp()), and is accurate to 2 decimal places up to about x of +/-1.5, and within the correct order of magnitude across the range you stated.例如,如果您使用近似值 (1+x/256)^256,在我用 Java 进行的 Pentium 测试中(我假设 C# 基本上编译为相同的处理器指令),这大约比 e^x 快 7-8 倍(Math.exp()),精确到小数点后 2 位,最多约为 +/-1.5 的 x,并且在您所述范围内的正确数量级内。 (Obviously, to raise to the 256, you actually square the number 8 times -- don't use Math.Pow for this!) In Java: (显然,要提高到 256,您实际上要对数字进行 8 次平方——不要为此使用 Math.Pow!)在 Java 中:

double eapprox = (1d + x / 256d);
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;
eapprox *= eapprox;

Keep doubling or halving 256 (and adding/removing a multiplication) depending on how accurate you want the approximation to be.根据您希望近似值的准确程度,将 256 加倍或减半(并添加/删除乘法)。 Even with n=4, it still gives about 1.5 decimal places of accuracy for values of x beween -0.5 and 0.5 (and appears a good 15 times faster than Math.exp()).即使 n=4,它仍然为 -0.5 和 0.5 之间的 x 值提供大约 1.5 个小数位的精度(并且看起来比 Math.exp() 快 15 倍)。

PS I forgot to mention -- you should obviously not really divide by 256: multiply by a constant 1/256. PS 我忘了提到——你显然不应该真正除以 256:乘以常数 1/256。 Java's JIT compiler makes this optimisation automatically (at least, Hotspot does), and I was assuming that C# must do too. Java 的 JIT 编译器自动进行这种优化(至少 Hotspot 是这样),我假设 C# 也必须这样做。

Have a look at this post .看看这个帖子 it has an approximation for e^x written in Java, this should be the C# code for it (untested):它有一个用 Java 编写的 e^x 的近似值,这应该是它的 C# 代码(未经测试):

public static double Exp(double val) {  
    long tmp = (long) (1512775 * val + 1072632447);  
    return BitConverter.Int64BitsToDouble(tmp << 32);  
}

In my benchmarks this is more than 5 times faster than Math.exp() (in Java).在我的基准测试中,这比 Math.exp() (在 Java 中)快 5 倍以上 The approximation is based on the paper " A Fast, Compact Approximation of the Exponential Function " which was developed exactly to be used in neural nets.该近似基于论文“指数函数的快速、紧凑的近似”,该论文正是为在神经网络中使用而开发的。 It is basically the same as a lookup table of 2048 entries and linear approximation between the entries, but all this with IEEE floating point tricks.它与 2048 个条目的查找表和条目之间的线性近似基本相同,但这一切都带有 IEEE 浮点技巧。

EDIT: According to Special Sauce this is ~3.25x faster than the CLR implementation.编辑:根据Special Sauce,这比 CLR 实现快约 3.25 倍。 Thanks!谢谢!

  1. Remember, that any changes in this activation function come at cost of different behavior .请记住,此激活函数的任何更改都是以不同行为为代价的 This even includes switching to float (and thus lowering the precision) or using activation substitutes.这甚至包括切换到浮动(从而降低精度)或使用激活替代。 Only experimenting with your use case will show the right way.只有对您的用例进行试验才会显示正确的方式。
  2. In addition to the simple code optimizations, I would also recommend to consider parallelization of the computations (ie: to leverage multiple cores of your machine or even machines at the Windows Azure Clouds) and improving the training algorithms.除了简单的代码优化之外,我还建议考虑计算的并行化(即:利用机器的多个内核甚至 Windows Azure 云中的机器)并改进训练算法。

UPDATE: Post on lookup tables for ANN activation functions更新: 发布在 ANN 激活函数的查找表上

UPDATE2: I removed the point on LUTs since I've confused these with the complete hashing. UPDATE2:我删除了 LUT 上的要点,因为我将它们与完整的散列混淆了。 Thanks go to Henrik Gustafsson for putting me back on the track.感谢Henrik Gustafsson让我重回正轨。 So the memory is not an issue, although the search space still gets messed up with local extrema a bit.所以内存不是问题,尽管搜索空间仍然会被局部极值弄乱。

At 100 million calls, i'd start to wonder if profiler overhead isn't skewing your results.在 1 亿次调用中,我开始怀疑分析器开销是否不会影响您的结果。 Replace the calculation with a no-op and see if it is still reported to consume 60% of the execution time...用no-op替换计算,看看是否仍然报告消耗了60%的执行时间......

Or better yet, create some test data and use a stopwatch timer to profile a million or so calls.或者更好的是,创建一些测试数据并使用秒表计时器来分析一百万左右的呼叫。

If you're able to interop with C++, you could consider storing all the values in an array and loop over them using SSE like this:如果您能够与 C++ 互操作,您可以考虑将所有值存储在一个数组中,并使用 SSE 循环遍历它们,如下所示:

void sigmoid_sse(float *a_Values, float *a_Output, size_t a_Size){
    __m128* l_Output = (__m128*)a_Output;
    __m128* l_Start  = (__m128*)a_Values;
    __m128* l_End    = (__m128*)(a_Values + a_Size);

    const __m128 l_One        = _mm_set_ps1(1.f);
    const __m128 l_Half       = _mm_set_ps1(1.f / 2.f);
    const __m128 l_OneOver6   = _mm_set_ps1(1.f / 6.f);
    const __m128 l_OneOver24  = _mm_set_ps1(1.f / 24.f);
    const __m128 l_OneOver120 = _mm_set_ps1(1.f / 120.f);
    const __m128 l_OneOver720 = _mm_set_ps1(1.f / 720.f);
    const __m128 l_MinOne     = _mm_set_ps1(-1.f);

    for(__m128 *i = l_Start; i < l_End; i++){
        // 1.0 / (1.0 + Math.Pow(Math.E, -value))
        // 1.0 / (1.0 + Math.Exp(-value))

        // value = *i so we need -value
        __m128 value = _mm_mul_ps(l_MinOne, *i);

        // exp expressed as inifite series 1 + x + (x ^ 2 / 2!) + (x ^ 3 / 3!) ...
        __m128 x = value;

        // result in l_Exp
        __m128 l_Exp = l_One; // = 1

        l_Exp = _mm_add_ps(l_Exp, x); // += x

        x = _mm_mul_ps(x, x); // = x ^ 2
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_Half, x)); // += (x ^ 2 * (1 / 2))

        x = _mm_mul_ps(value, x); // = x ^ 3
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver6, x)); // += (x ^ 3 * (1 / 6))

        x = _mm_mul_ps(value, x); // = x ^ 4
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver24, x)); // += (x ^ 4 * (1 / 24))

#ifdef MORE_ACCURATE

        x = _mm_mul_ps(value, x); // = x ^ 5
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver120, x)); // += (x ^ 5 * (1 / 120))

        x = _mm_mul_ps(value, x); // = x ^ 6
        l_Exp = _mm_add_ps(l_Exp, _mm_mul_ps(l_OneOver720, x)); // += (x ^ 6 * (1 / 720))

#endif

        // we've calculated exp of -i
        // now we only need to do the '1.0 / (1.0 + ...' part
        *l_Output++ = _mm_rcp_ps(_mm_add_ps(l_One,  l_Exp));
    }
}

However, remember that the arrays you'll be using should be allocated using _aligned_malloc(some_size * sizeof(float), 16) because SSE requires memory aligned to a boundary.但是,请记住,您将使用的数组应该使用 _aligned_malloc(some_size * sizeof(float), 16) 进行分配,因为 SSE 需要与边界对齐的内存。

Using SSE, I can calculate the result for all 100 million elements in around half a second.使用 SSE,我可以在大约半秒内计算出所有 1 亿个元素的结果。 However, allocating that much memory at a time will cost you nearly two-third of a gigabyte so I'd suggest processing more but smaller arrays at a time.但是,一次分配这么多内存将花费您将近三分之二的 GB,因此我建议一次处理更多但更小的数组。 You might even want to consider using a double buffering approach with 100K elements or more.您甚至可能要考虑使用具有 100K 或更多元素的双缓冲方法。

Also, if the number of elements starts to grow considerably you might want to choose to process these things on the GPU (just create a 1D float4 texture and run a very trivial fragment shader).此外,如果元素数量开始大幅增长,您可能希望选择在 GPU 上处理这些事情(只需创建一个 1D float4 纹理并运行一个非常简单的片段着色器)。

FWIW, here's my C# benchmarks for the answers already posted. FWIW,这是我已经发布的答案的 C# 基准测试。 (Empty is a function that just returns 0, to measure the function call overhead) (Empty是一个只返回0的函数,用来衡量函数调用的开销)

Empty Function:       79ms   0
Original:             1576ms 0.7202294
Simplified: (soprano) 681ms  0.7202294
Approximate: (Neil)   441ms  0.7198783
Bit Manip: (martinus) 836ms  0.72318
Taylor: (Rex Logan)   261ms  0.7202305
Lookup: (Henrik)      182ms  0.7204863
public static object[] Time(Func<double, float> f) {
    var testvalue = 0.9456;
    var sw = new Stopwatch();
    sw.Start();
    for (int i = 0; i < 1e7; i++)
        f(testvalue);
    return new object[] { sw.ElapsedMilliseconds, f(testvalue) };
}
public static void Main(string[] args) {
    Console.WriteLine("Empty:       {0,10}ms {1}", Time(Empty));
    Console.WriteLine("Original:    {0,10}ms {1}", Time(Original));
    Console.WriteLine("Simplified:  {0,10}ms {1}", Time(Simplified));
    Console.WriteLine("Approximate: {0,10}ms {1}", Time(ExpApproximation));
    Console.WriteLine("Bit Manip:   {0,10}ms {1}", Time(BitBashing));
    Console.WriteLine("Taylor:      {0,10}ms {1}", Time(TaylorExpansion));
    Console.WriteLine("Lookup:      {0,10}ms {1}", Time(LUT));
}

Note: This is a follow-up to this post.注意:这是这篇文章的后续。

Edit: Update to calculate the same thing as this and this , taking some inspiration from this .编辑:更新以计算与thisthis相同的东西,从this 中获取一些灵感。

Now look what you made me do!现在看看你让我做什么! You made me install Mono!你让我安装 Mono!

$ gmcs -optimize test.cs && mono test.exe
Max deviation is 0.001663983
10^7 iterations using Sigmoid1() took 1646.613 ms
10^7 iterations using Sigmoid2() took 237.352 ms

C is hardly worth the effort anymore, the world is moving forward :) C 已经不值得付出努力了,世界正在向前发展 :)

So, just over a factor 10 6 faster.因此,速度快了 10 6 倍。 Someone with a windows box gets to investigate the memory usage and performance using MS-stuff :)有 Windows 盒子的人可以使用 MS-stuff 调查内存使用情况和性能:)

Using LUTs for activation functions is not so uncommon, especielly when implemented in hardware.将 LUT 用于激活函数并不少见,尤其是在硬件中实现时。 There are many well proven variants of the concept out there if you are willing to include those types of tables.如果您愿意包含这些类型的表格,则有许多经过充分验证的概念变体。 However, as have already been pointed out, aliasing might turn out to be a problem, but there are ways around that too.然而,正如已经指出的那样,混叠可能会成为一个问题,但也有办法解决这个问题。 Some further reading:一些进一步的阅读:

Some gotchas with this:一些问题:

  • The error goes up when you reach outside the table (but converges to 0 at the extremes);当您到达表格外时,误差会增加(但在极端情况下收敛到 0); for x approx +-7.0.对于 x 大约 +-7.0。 This is due to the chosen scaling factor.这是由于选择的比例因子。 Larger values of SCALE give higher errors in the middle range, but smaller at the edges. SCALE 值越大,中间范围的误差越大,但边缘的误差越小。
  • This is generally a very stupid test, and I don't know C#, It's just a plain conversion of my C-code :)这通常是一个非常愚蠢的测试,我不知道 C#,这只是我的 C 代码的简单转换:)
  • Rinat Abdullin is very much correct that aliasing and precision loss might cause problems, but since I have not seen the variables for that I can only advice you to try this. Rinat Abdullin非常正确地认为混叠和精度损失可能会导致问题,但由于我没有看到变量,我只能建议您尝试一下。 In fact, I agree with everything he says except for the issue of lookup tables.事实上,我同意他所说的一切,除了查找表的问题。

Pardon the copy-paste coding...请原谅复制粘贴编码...

using System;
using System.Diagnostics;

class LUTTest {
    private const float SCALE = 320.0f;
    private const int RESOLUTION = 2047;
    private const float MIN = -RESOLUTION / SCALE;
    private const float MAX = RESOLUTION / SCALE;

    private static readonly float[] lut = InitLUT();

    private static float[] InitLUT() {
      var lut = new float[RESOLUTION + 1];

      for (int i = 0; i < RESOLUTION + 1; i++) {
        lut[i] = (float)(1.0 / (1.0 + Math.Exp(-i / SCALE)));
      }
      return lut;
    }

    public static float Sigmoid1(double value) {
        return (float) (1.0 / (1.0 + Math.Exp(-value)));
    }

    public static float Sigmoid2(float value) {
      if (value <= MIN) return 0.0f;
      if (value >= MAX) return 1.0f;
      if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
      return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
    }

    public static float error(float v0, float v1) {
      return Math.Abs(v1 - v0);
    }

    public static float TestError() {
        float emax = 0.0f;
        for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
          float v0 = Sigmoid1(x);
          float v1 = Sigmoid2(x);
          float e = error(v0, v1);
          if (e > emax) emax = e;
        }
        return emax;
    }

    public static double TestPerformancePlain() {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                Sigmoid1(x);
            }
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds;
    }    

    public static double TestPerformanceLUT() {
        Stopwatch sw = new Stopwatch();
        sw.Start();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                Sigmoid2(x);
            }
        }
        sw.Stop();
        return sw.Elapsed.TotalMilliseconds;
    }    

    static void Main() {
        Console.WriteLine("Max deviation is {0}", TestError());
        Console.WriteLine("10^7 iterations using Sigmoid1() took {0} ms", TestPerformancePlain());
        Console.WriteLine("10^7 iterations using Sigmoid2() took {0} ms", TestPerformanceLUT());
    }
}

F# Has Better Performance than C# in .NET math algorithms.在 .NET 数学算法中,F# 比 C# 具有更好的性能。 So rewriting neural network in F# might improve the overall performance.因此,在 F# 中重写神经网络可能会提高整体性能。

If we re-implement LUT benchmarking snippet (I've been using slightly tweaked version) in F#, then the resulting code:如果我们在 F# 中重新实现LUT 基准测试片段(我一直在使用稍微调整过的版本),那么生成的代码:

  • executes sigmoid1 benchmark in 588.8ms instead of 3899,2ms588.8 毫秒而不是 3899,2 毫秒内执行 sigmoid1 基准测试
  • executes sigmoid2 (LUT) benchmark in 156.6ms instead of 411.4 ms156.6 毫秒而不是 411.4 毫秒内执行 sigmoid2 (LUT) 基准测试

More details could be found in the blog post .可以在博客文章中找到更多详细信息。 Here's the F# snippet JIC:这是 F# 代码段 JIC:

#light

let Scale = 320.0f;
let Resolution = 2047;

let Min = -single(Resolution)/Scale;
let Max = single(Resolution)/Scale;

let range step a b =
  let count = int((b-a)/step);
  seq { for i in 0 .. count -> single(i)*step + a };

let lut = [| 
  for x in 0 .. Resolution ->
    single(1.0/(1.0 +  exp(-double(x)/double(Scale))))
  |]

let sigmoid1 value = 1.0f/(1.0f + exp(-value));

let sigmoid2 v = 
  if (v <= Min) then 0.0f;
  elif (v>= Max) then 1.0f;
  else
    let f = v * Scale;
    if (v>0.0f) then lut.[int (f + 0.5f)]
    else 1.0f - lut.[int(0.5f - f)];

let getError f = 
  let test = range 0.00001f -10.0f 10.0f;
  let errors = seq { 
    for v in test -> 
      abs(sigmoid1(single(v)) - f(single(v)))
  }
  Seq.max errors;

open System.Diagnostics;

let test f = 
  let sw = Stopwatch.StartNew(); 
  let mutable m = 0.0f;
  let result = 
    for t in 1 .. 10 do
      for x in 1 .. 1000000 do
        m <- f(single(x)/100000.0f-5.0f);
  sw.Elapsed.TotalMilliseconds;

printf "Max deviation is %f\n" (getError sigmoid2)
printf "10^7 iterations using sigmoid1: %f ms\n" (test sigmoid1)
printf "10^7 iterations using sigmoid2: %f ms\n" (test sigmoid2)

let c = System.Console.ReadKey(true);

And the output (Release compilation against F# 1.9.6.2 CTP with no debugger):和输出(针对 F# 1.9.6.2 CTP 发布编译,没有调试器):

Max deviation is 0.001664
10^7 iterations using sigmoid1: 588.843700 ms
10^7 iterations using sigmoid2: 156.626700 ms

UPDATE: updated benchmarking to use 10^7 iterations to make results comparable with C更新:更新基准测试以使用 10^7 次迭代使结果与 C 相当

UPDATE2: here are the performance results of the C implementation from the same machine to compare with: UPDATE2:这里是来自同一台机器的C 实现的性能结果进行比较:

Max deviation is 0.001664
10^7 iterations using sigmoid1: 628 ms
10^7 iterations using sigmoid2: 157 ms

First thought: How about some stats on the values variable?第一个想法:values 变量的一些统计数据怎么样?

  • Are the values of "value" typically small -10 <= value <= 10? “值”的值是否通常很小 -10 <= value <= 10?

If not, you can probably get a boost by testing for out of bounds values如果没有,您可能可以通过测试越界值来获得提升

if(value < -10)  return 0;
if(value > 10)  return 1;
  • Are the values repeated often?这些值是否经常重复?

If so, you can probably get some benefit from Memoization (probably not, but it doesn't hurt to check....)如果是这样,您可能会从Memoization 中获得一些好处(可能不会,但检查一下也无妨....)

if(sigmoidCache.containsKey(value)) return sigmoidCache.get(value);

If neither of these can be applied, then as some others have suggested, maybe you can get away with lowering the accuracy of your sigmoid...如果这些都不能应用,那么正如其他人所建议的那样,也许您可​​以通过降低 sigmoid 的准确性来逃脱...

Soprano had some nice optimizations your call: Soprano 有一些不错的优化你的电话:

public static float Sigmoid(double value) 
{
    float k = Math.Exp(value);
    return k / (1.0f + k);
}

If you try a lookup table and find it uses too much memory you could always looking at the value of your parameter for each successive calls and employing some caching technique.如果您尝试查找表并发现它使用了太多内存,您可以随时查看每个连续调用的参数值并使用一些缓存技术。

For example try caching the last value and result.例如尝试缓存最后一个值和结果。 If the next call has the same value as the previous one, you don't need to calculate it as you'd have cached the last result.如果下一个调用的值与前一个调用的值相同,则不需要像缓存最后一个结果那样计算它。 If the current call was the same as the previous call even 1 out of a 100 times, you could potentially save yourself 1 million calculations.如果当前调用与前一个调用相同,即使是 100 次中的 1 次,您也有可能节省 100 万次计算。

Or, you may find that within 10 successive calls, the value parameter is on average the same 2 times, so you could then try caching the last 10 values/answers.或者,您可能会发现在 10 次连续调用中,value 参数平均有 2 次相同,因此您可以尝试缓存最后 10 个值/答案。

Off the top of my head, this paper explains a way of approximating the exponential by abusing floating point , (click the link in the top right for PDF) but I don't know if it'll be of much use to you in .NET.在我的脑海里, 这篇论文解释了一种通过滥用浮点来逼近指数的方法,(单击右上角的 PDF 链接)但我不知道它是否对你在 .网。

Also, another point: for the purpose of training large networks quickly, the logistic sigmoid you're using is pretty terrible.另外,还有一点:为了快速训练大型网络,您使用的逻辑 sigmoid 非常糟糕。 See section 4.4 of Efficient Backprop by LeCun et al and use something zero-centered (actually, read that whole paper, it's immensely useful).请参阅LeCun 等人Efficient Backprop 的第 4.4 节并使用以零为中心的内容(实际上,阅读整篇论文,它非常有用)。

There are a much faster functions that do very similar things:有一个更快的函数可以做非常相似的事情:

x / (1 + abs(x)) – fast replacement for TAHN x / (1 + abs(x)) – 快速替代 TAHN

And similarly:同样:

x / (2 + 2 * abs(x)) + 0.5 - fast replacement for SIGMOID x / (2 + 2 * abs(x)) + 0.5 - SIGMOID 的快速替代

Compare plots with actual sigmoid 将图与实际 sigmoid 进行比较

想法:也许您可以使用预先计算的值制作(大)查找表?

This is slightly off topic, but just out of curiosity, I did the same implementation as the one in C , C# and F# in Java.这有点偏离主题,但出于好奇,我做了与 Java 中的CC#F#相同的实现。 I'll just leave this here in case someone else is curious.我会把这个留在这里以防其他人好奇。

Result:结果:

$ javac LUTTest.java && java LUTTest
Max deviation is 0.001664
10^7 iterations using sigmoid1() took 1398 ms
10^7 iterations using sigmoid2() took 177 ms

I suppose the improvement over C# in my case is due to Java being better optimized than Mono for OS X. On a similar MS .NET-implementation (vs. Java 6 if someone wants to post comparative numbers) I suppose the results would be different.我想在我的情况下对 C# 的改进是由于 Java 比 OS X 的 Mono 优化得更好。在类似的 MS .NET 实现上(如果有人想发布比较数字,则与 Java 6 相比)我想结果会有所不同.

Code:代码:

public class LUTTest {
    private static final float SCALE = 320.0f;
    private static final  int RESOLUTION = 2047;
    private static final  float MIN = -RESOLUTION / SCALE;
    private static final  float MAX = RESOLUTION / SCALE;

    private static final float[] lut = initLUT();

    private static float[] initLUT() {
        float[] lut = new float[RESOLUTION + 1];

        for (int i = 0; i < RESOLUTION + 1; i++) {
            lut[i] = (float)(1.0 / (1.0 + Math.exp(-i / SCALE)));
        }
        return lut;
    }

    public static float sigmoid1(double value) {
        return (float) (1.0 / (1.0 + Math.exp(-value)));
    }

    public static float sigmoid2(float value) {
        if (value <= MIN) return 0.0f;
        if (value >= MAX) return 1.0f;
        if (value >= 0) return lut[(int)(value * SCALE + 0.5f)];
        return 1.0f - lut[(int)(-value * SCALE + 0.5f)];
    }

    public static float error(float v0, float v1) {
        return Math.abs(v1 - v0);
    }

    public static float testError() {
        float emax = 0.0f;
        for (float x = -10.0f; x < 10.0f; x+= 0.00001f) {
            float v0 = sigmoid1(x);
            float v1 = sigmoid2(x);
            float e = error(v0, v1);
            if (e > emax) emax = e;
        }
        return emax;
    }

    public static long sigmoid1Perf() {
        float y = 0.0f;
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                y = sigmoid1(x);
            }
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("",y);
        return t1 - t0;
    }    

    public static long sigmoid2Perf() {
        float y = 0.0f;
        long t0 = System.currentTimeMillis();
        for (int i = 0; i < 10; i++) {
            for (float x = -5.0f; x < 5.0f; x+= 0.00001f) {
                y = sigmoid2(x);
            }
        }
        long t1 = System.currentTimeMillis();
        System.out.printf("",y);
        return t1 - t0;
    }    

    public static void main(String[] args) {

        System.out.printf("Max deviation is %f\n", testError());
        System.out.printf("10^7 iterations using sigmoid1() took %d ms\n", sigmoid1Perf());
        System.out.printf("10^7 iterations using sigmoid2() took %d ms\n", sigmoid2Perf());
    }
}

I realize that it has been a year since this question popped up, but I ran across it because of the discussion of F# and C performance relative to C#.我意识到这个问题出现已经一年了,但是因为 F# 和 C 性能相对于 C# 的讨论,我遇到了它。 I played with some of the samples from other responders and discovered that delegates appear to execute faster than a regular method invocation but there is no apparent peformance advantage to F# over C# .我使用了其他响应者的一些示例,发现委托似乎比常规方法调用执行得更快,但F# 与 C# 相比没有明显的性能优势

  • C: 166ms C: 166 毫秒
  • C# (delegate): 275ms C#(委托):275ms
  • C# (method): 431ms C#(方法):431 毫秒
  • C# (method, float counter): 2,656ms C#(方法,浮点计数器):2,656 毫秒
  • F#: 404ms F#:404 毫秒

The C# with a float counter was a straight port of the C code.带有浮点计数器的 C# 是 C 代码的直接移植。 It is much faster to use an int in the for loop.在 for 循环中使用 int 会快得多。

There are a lot of good answers here.这里有很多很好的答案。 I would suggest running it through this technique , just to make sure我建议通过这种技术运行它,只是为了确保

  • You're not calling it any more times than you need to.你不会调用它超过你需要的次数。
    (Sometimes functions get called more than necessary, just because they are so easy to call.) (有时函数被调用的次数过多,只是因为它们很容易调用。)
  • You're not calling it repeatedly with the same arguments你不会用相同的参数重复调用它
    (where you could use memoization) (您可以在其中使用记忆功能)

BTW the function you have is the inverse logit function,顺便说一句,您拥有的函数是逆 logit 函数,
or the inverse of the log-odds-ratio function log(f/(1-f)) .或对数优势比函数log(f/(1-f))的倒数。

(Updated with performance measurements)(Updated again with real results :) (更新了性能测量)(再次更新了真实结果:)

I think a lookup table solution would get you very far when it comes to performance, at a negligible memory and precision cost.我认为查找表解决方案可以让您在性能方面走得更远,而内存和精度成本可以忽略不计。

The following snippet is an example implementation in C (I don't speak c# fluently enough to dry-code it).下面的代码片段是 C 中的一个示例实现(我的 C# 说得不够流利,无法对其进行干编码)。 It runs and performs well enough, but I'm sure there's a bug in it :)它运行和性能足够好,但我确定它有一个错误:)

#include <math.h>
#include <stdio.h>
#include <time.h>

#define SCALE 320.0f
#define RESOLUTION 2047
#define MIN -RESOLUTION / SCALE
#define MAX RESOLUTION / SCALE

static float sigmoid_lut[RESOLUTION + 1];

void init_sigmoid_lut(void) {
    int i;    
    for (i = 0; i < RESOLUTION + 1; i++) {
        sigmoid_lut[i] =  (1.0 / (1.0 + exp(-i / SCALE)));
    }
}

static float sigmoid1(const float value) {
    return (1.0f / (1.0f + expf(-value)));
}

static float sigmoid2(const float value) {
    if (value <= MIN) return 0.0f;
    if (value >= MAX) return 1.0f;
    if (value >= 0) return sigmoid_lut[(int)(value * SCALE + 0.5f)];
    return 1.0f-sigmoid_lut[(int)(-value * SCALE + 0.5f)];
}

float test_error() {
    float x;
    float emax = 0.0;

    for (x = -10.0f; x < 10.0f; x+=0.00001f) {
        float v0 = sigmoid1(x);
        float v1 = sigmoid2(x);
        float error = fabsf(v1 - v0);
        if (error > emax) { emax = error; }
    } 
    return emax;
}

int sigmoid1_perf() {
    clock_t t0, t1;
    int i;
    float x, y = 0.0f;

    t0 = clock();
    for (i = 0; i < 10; i++) {
        for (x = -5.0f; x <= 5.0f; x+=0.00001f) {
            y = sigmoid1(x);
        }
    }
    t1 = clock();
    printf("", y); /* To avoid sigmoidX() calls being optimized away */
    return (t1 - t0) / (CLOCKS_PER_SEC / 1000);
}

int sigmoid2_perf() {
    clock_t t0, t1;
    int i;
    float x, y = 0.0f;
    t0 = clock();
    for (i = 0; i < 10; i++) {
        for (x = -5.0f; x <= 5.0f; x+=0.00001f) {
            y = sigmoid2(x);
        }
    }
    t1 = clock();
    printf("", y); /* To avoid sigmoidX() calls being optimized away */
    return (t1 - t0) / (CLOCKS_PER_SEC / 1000);
}

int main(void) {
    init_sigmoid_lut();
    printf("Max deviation is %0.6f\n", test_error());
    printf("10^7 iterations using sigmoid1: %d ms\n", sigmoid1_perf());
    printf("10^7 iterations using sigmoid2: %d ms\n", sigmoid2_perf());

    return 0;
}

Previous results were due to the optimizer doing its job and optimized away the calculations.以前的结果是由于优化器完成了它的工作并优化了计算。 Making it actually execute the code yields slightly different and much more interesting results (on my way slow MB Air):让它实际执行代码会产生稍微不同但更有趣的结果(在我的路上慢 MB Air):

$ gcc -O2 test.c -o test && ./test
Max deviation is 0.001664
10^7 iterations using sigmoid1: 571 ms
10^7 iterations using sigmoid2: 113 ms

轮廓


TODO:去做:

There are things to improve and ways to remove weaknesses;有改进的地方和消除弱点的方法; how to do is is left as an exercise to the reader :)如何做是留给读者的练习:)

  • Tune the range of the function to avoid the jump where the table starts and ends.调整函数的范围以避免表格开始和结束的跳转。
  • Add a slight noise function to hide the aliasing artifacts.添加轻微的噪声功能以隐藏混叠伪影。
  • As Rex said, interpolation could get you quite a bit further precision-wise while being rather cheap performance-wise.正如 Rex 所说,插值可以让你在精度方面更进一步,同时在性能方面相当便宜。

You might also consider experimenting with alternative activation functions which are cheaper to evaluate.您还可以考虑尝试使用评估成本更低的替代激活函数。 For example:例如:

f(x) = (3x - x**3)/2

(which could be factored as (这可以被分解为

f(x) = x*(3 - x*x)/2

for one less multiplication).少一个乘法)。 This function has odd symmetry, and its derivative is trivial.该函数具有奇对称性,其导数是微不足道的。 Using it for a neural network requires normalizing the sum-of-inputs by dividing by the total number of inputs (limiting the domain to [-1..1], which is also range).将其用于神经网络需要通过除以输入总数来归一化输入总和(将域限制为 [-1..1],这也是范围)。

A mild variation on Soprano's theme:女高音主题的轻微变化:

public static float Sigmoid(double value) {
    float v = value;
    float k = Math.Exp(v);
    return k / (1.0f + k);
}

Since you're only after a single precision result, why make the Math.Exp function calculate a double?既然您只需要单精度结果,为什么要让 Math.Exp 函数计算双精度呢? Any exponent calculator that uses an iterative summation (see the expansion of the e x ) will take longer for more precision, each and every time.任何使用迭代求和(参见e x的扩展)的指数计算器每次都需要更长的时间才能获得更高的精度。 And double is twice the work of single!双是单工作的两倍! This way, you convert to single first, then do your exponential.这样,您首先转换为单身,然后再进行指数计算。

But the expf function should be faster still.但是 expf 函数应该更快。 I don't see the need for soprano's (float) cast in passing to expf though, unless C# doesn't do implicit float-double conversion.不过,除非 C# 不进行隐式 float-double 转换,否则我认为不需要将女高音的 (float) 转换为传递给 expf。

Otherwise, just use a real language, like FORTRAN...否则,只需使用真正的语言,例如 FORTRAN ......

Doing a Google search, I found an alternative implementation of the Sigmoid function.通过 Google 搜索,我找到了 Sigmoid 函数的替代实现。

public double Sigmoid(double x)
{
   return 2 / (1 + Math.Exp(-2 * x)) - 1;
}

Is that correct for your needs?这对您的需求是否正确? Is it faster?它更快吗?

http://dynamicnotions.blogspot.com/2008/09/sigmoid-function-in-c.html http://dynamicnotions.blogspot.com/2008/09/sigmoid-function-in-c.html

1) Do you call this from only one place? 1)你只从一个地方调用它吗? If so, you may gain a small amount of performance by moving the code out of that function and just putting it right where you would normally have called the Sigmoid function.如果是这样,您可以通过将代码移出该函数并将其放在通常调用 Sigmoid 函数的位置来获得少量性能。 I don't like this idea in terms of code readability and organization but when you need to get every last performance gain, this might help because I think function calls require a push/pop of registers on the stack, which could be avoided if your code was all inline.在代码可读性和组织方面,我不喜欢这个想法,但是当您需要获得每一个最后的性能提升时,这可能会有所帮助,因为我认为函数调用需要在堆栈上推送/弹出寄存器,如果您代码都是内联的。

2) I have no idea if this might help but try making your function parameter a ref parameter. 2)我不知道这是否有帮助,但请尝试将您的函数参数设为 ref 参数。 See if it's faster.看看是不是更快。 I would have suggested making it const (which would have been an optimization if this were in c++) but c# doesn't support const parameters.我会建议将其设置为 const(如果在 c++ 中,这将是一种优化),但 c# 不支持 const 参数。

If you need a giant speed boost, you could probably look into parallelizing the function using the (ge)force.如果您需要巨大的速度提升,您可能会考虑使用 (ge)force 来并行化该功能。 IOW, use DirectX to control the graphics card into doing it for you. IOW,使用DirectX控制显卡为您做。 I have no idea how to do this, but I've seen people use graphics cards for all kinds of calculations.我不知道如何做到这一点,但我见过人们使用显卡进行各种计算。

I've seen that a lot of people around here are trying to use approximation to make Sigmoid faster.我已经看到这里的很多人都在尝试使用近似来使 Sigmoid 更快。 However, it is important to know that Sigmoid can also be expressed using tanh, not only exp.但是,重要的是要知道 Sigmoid 也可以使用 tanh 表示,而不仅仅是 exp。 Calculating Sigmoid this way is around 5 times faster than with exponential, and by using this method you are not approximating anything, thus the original behaviour of Sigmoid is kept as-is.以这种方式计算 Sigmoid 比使用指数计算快 5 倍左右,并且通过使用这种方法,您不会逼近任何东西,因此 Sigmoid 的原始行为保持原样。

    public static double Sigmoid(double value)
    {
        return 0.5d + 0.5d * Math.Tanh(value/2);
    }

Of course, parellization would be the next step to performance improvement, but as far as the raw calculation is concerned, using Math.Tanh is faster than Math.Exp.当然,parellization 将是性能改进的下一步,但就原始计算而言,使用 Math.Tanh 比 Math.Exp 更快。

Remember, Sigmoid constraints results to range between 0 and 1. Values of smaller than about -10 return a value very, very close to 0.0.请记住, Sigmoid约束的结果范围在 0 和 1 之间。小于约 -10 的值返回非常非常接近 0.0 的值。 Values of greater than about 10 return a value very, very close to 1.大于 10 的值会返回非常非常接近 1 的值。

Back in the old days when computers couldn't handle arithmetic overflow/underflow that well, putting if conditions to limit the calculation was usual.回到过去,当计算机无法很好地处理算术上溢/下溢时,通常会使用 if 条件来限制计算。 If I were really concerned about its performance (or basically Math's performance), I would change your code to the old fashioned way (and mind the limits) so that it does not call Math unnecessarily:如果我真的关心它的性能(或基本上是 Math 的性能),我会将您的代码更改为老式方式(并注意限制),以便它不会不必要地调用 Math:

public double Sigmoid(double value)
{
    if (value < -45.0) return 0.0;
    if (value > 45.0) return 1.0;
    return 1.0 / (1.0 + Math.Exp(-value));
}

I realize anyone reading this answer may be involved in some sort of NN development.我意识到任何阅读此答案的人都可能参与了某种神经网络开发。 Be mindful of how the above affects the other aspects of your training scores.请注意上述情况如何影响您的训练分数的其他方面。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM