如何在C＃中快速从另一个中减去一个ushort数组？

Question

I need to quickly subtract each value in ushort arrayA from the corresponding index value in ushort arrayB which has an identical length. 我需要从ushort arrayB中具有相同长度的相应索引值中快速减去ushort arrayA中的每个值。

In addition, if the difference is negative, I need to store a zero, not the negative difference. 另外，如果差异为负，我需要存储零，而不是负差。

(Length = 327680 to be exact, since I'm subtracting a 640x512 image from another image of identical size). （确切地说，长度= 327680，因为我从另一个相同大小的图像中减去640x512图像）。

The code below is currently taking ~20ms and I'd like to get it down under ~5ms if possible. 下面的代码目前需要大约20ms，如果可能的话，我想在~5ms内将其降低。 Unsafe code is ok, but please provide an example, as I'm not super-skilled at writing unsafe code. 不安全的代码是可以的，但请提供一个例子，因为我不擅长编写不安全的代码。

Thank you! 谢谢！

public ushort[] Buffer { get; set; }

public void SubtractBackgroundFromBuffer(ushort[] backgroundBuffer)
{
    System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
    sw.Start();

    int bufferLength = Buffer.Length;

    for (int index = 0; index < bufferLength; index++)
    {
        int difference = Buffer[index] - backgroundBuffer[index];

        if (difference >= 0)
            Buffer[index] = (ushort)difference;
        else
            Buffer[index] = 0;
    }

    Debug.WriteLine("SubtractBackgroundFromBuffer(ms): " + sw.Elapsed.TotalMilliseconds.ToString("N2"));
}

UPDATE: While it's not strictly C#, for the benefit of others who read this, I finally ended up adding a C++ CLR Class Library to my solution with the following code. 更新：虽然它不是严格的C＃，为了其他人的利益，我终于最终使用以下代码将C ++ CLR类库添加到我的解决方案中。 It runs in ~3.1ms. 它运行在~3.1ms。 If an unmanaged C++ library is used, it runs in ~2.2ms. 如果使用非托管C ++库，则运行时间约为2.2毫秒。 I decided to go with the managed library since the time difference was small. 由于时差很小，我决定使用托管库。

// SpeedCode.h
#pragma once
using namespace System;

namespace SpeedCode
{
    public ref class SpeedClass
    {
        public:
            static void SpeedSubtractBackgroundFromBuffer(array<UInt16> ^ buffer, array<UInt16> ^ backgroundBuffer, int bufferLength);
    };
}

// SpeedCode.cpp
// This is the main DLL file.
#include "stdafx.h"
#include "SpeedCode.h"

namespace SpeedCode
{
    void SpeedClass::SpeedSubtractBackgroundFromBuffer(array<UInt16> ^ buffer, array<UInt16> ^ backgroundBuffer, int bufferLength)
    {
        for (int index = 0; index < bufferLength; index++)
        {
            buffer[index] = (UInt16)((buffer[index] - backgroundBuffer[index]) * (buffer[index] > backgroundBuffer[index]));
        }
    }
}

Then I call it like this: 然后我称之为：

    public void SubtractBackgroundFromBuffer(ushort[] backgroundBuffer)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        SpeedCode.SpeedClass.SpeedSubtractBackgroundFromBuffer(Buffer, backgroundBuffer, Buffer.Length);

        Debug.WriteLine("SubtractBackgroundFromBuffer(ms): " + sw.Elapsed.TotalMilliseconds.ToString("N2"));
    }

Answer 1

Some benchmarks. 一些基准。

SubtractBackgroundFromBuffer: this is the original method from the question. SubtractBackgroundFromBuffer:这是问题的原始方法。
SubtractBackgroundFromBufferWithCalcOpt: this is the original method augmented with TTat's idea for improving the calculation speed. SubtractBackgroundFromBufferWithCalcOpt:这是用TTat提高计算速度的原始方法。
SubtractBackgroundFromBufferParallelFor: the solution from Selman22's answer. SubtractBackgroundFromBufferParallelFor:来自Selman22答案的解决方案。
SubtractBackgroundFromBufferBlockParallelFor: my answer. SubtractBackgroundFromBufferBlockParallelFor:我的回答。 Similar to 3., but breaks the processing up into blocks of 4096 values. 与3.类似，但将处理分为4096个值的块。
SubtractBackgroundFromBufferPartitionedParallelForEach: Geoff's first answer. SubtractBackgroundFromBufferPartitionedParallelForEach: Geoff的第一个答案。
SubtractBackgroundFromBufferPartitionedParallelForEachHack: Geoff's second answer. SubtractBackgroundFromBufferPartitionedParallelForEachHack: Geoff的第二个答案。

Updates 更新

Interestingly, I can get a small speed increase (~6%) for SubtractBackgroundFromBufferBlockParallelFor by using (as suggested by Bruno Costa) 有趣的是，我可以通过使用（如Bruno Costa所建议的）为SubtractBackgroundFromBufferBlockParallelFor获得小幅度的增加（~6％）

Buffer[i] = (ushort)Math.Max(difference, 0);

instead of 代替

if (difference >= 0)
    Buffer[i] = (ushort)difference;
else
    Buffer[i] = 0;

Results 结果

Note that this is the total time for the 1000 iterations in each run. 请注意，这是每次运行中1000次迭代的总时间。

SubtractBackgroundFromBuffer(ms):                                 2,062.23 
SubtractBackgroundFromBufferWithCalcOpt(ms):                      2,245.42
SubtractBackgroundFromBufferParallelFor(ms):                      4,021.58
SubtractBackgroundFromBufferBlockParallelFor(ms):                   769.74
SubtractBackgroundFromBufferPartitionedParallelForEach(ms):         827.48
SubtractBackgroundFromBufferPartitionedParallelForEachHack(ms):     539.60

So it seems from those results that ~~the best approach combines the calculation optimizations for a small gain and the~~ makes use of Parallel.For to operate on chunks of the image. 因此，从这些结果看来， ~~最佳方法结合了小增益的计算优化和~~利用Parallel.For来操作图像的块。 Your mileage will of course vary, and performance of parallel code is sensitive to the CPU you are running. 您的里程当然会有所不同，并行代码的性能对您运行的CPU很敏感。

Test Harness 测试线束

I ran this for each method in Release mode. 我在发布模式下为每个方法运行了这个。 I am starting and stopping the Stopwatch this way to ensure only processing time is measured. 我这样开始并停止Stopwatch以确保只测量处理时间。

System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
ushort[] bgImg = GenerateRandomBuffer(327680, 818687447);

for (int i = 0; i < 1000; i++)
{
    Buffer = GenerateRandomBuffer(327680, 128011992);                

    sw.Start();
    SubtractBackgroundFromBuffer(bgImg);
    sw.Stop();
}

Console.WriteLine("SubtractBackgroundFromBuffer(ms): " + sw.Elapsed.TotalMilliseconds.ToString("N2"));


public static ushort[] GenerateRandomBuffer(int size, int randomSeed)
{
    ushort[] buffer = new ushort[327680];
    Random random = new Random(randomSeed);

    for (int i = 0; i < size; i++)
    {
        buffer[i] = (ushort)random.Next(ushort.MinValue, ushort.MaxValue);
    }

    return buffer;
}

The Methods 方法

public static void SubtractBackgroundFromBuffer(ushort[] backgroundBuffer)
{
    int bufferLength = Buffer.Length;

    for (int index = 0; index < bufferLength; index++)
    {
        int difference = Buffer[index] - backgroundBuffer[index];

        if (difference >= 0)
            Buffer[index] = (ushort)difference;
        else
            Buffer[index] = 0;
    }
}

public static void SubtractBackgroundFromBufferWithCalcOpt(ushort[] backgroundBuffer)
{
    int bufferLength = Buffer.Length;

    for (int index = 0; index < bufferLength; index++)
    {
        if (Buffer[index] < backgroundBuffer[index])
        {
            Buffer[index] = 0;
        }
        else
        {
            Buffer[index] -= backgroundBuffer[index];
        }
    }
}

public static void SubtractBackgroundFromBufferParallelFor(ushort[] backgroundBuffer)
{
    Parallel.For(0, Buffer.Length, (i) =>
    {
        int difference = Buffer[i] - backgroundBuffer[i];
        if (difference >= 0)
            Buffer[i] = (ushort)difference;
        else
            Buffer[i] = 0;
    });
}        

public static void SubtractBackgroundFromBufferBlockParallelFor(ushort[] backgroundBuffer)
{
    int blockSize = 4096;

    Parallel.For(0, (int)Math.Ceiling(Buffer.Length / (double)blockSize), (j) =>
    {
        for (int i = j * blockSize; i < (j + 1) * blockSize; i++)
        {
            int difference = Buffer[i] - backgroundBuffer[i];

            Buffer[i] = (ushort)Math.Max(difference, 0);                    
        }
    });
}

public static void SubtractBackgroundFromBufferPartitionedParallelForEach(ushort[] backgroundBuffer)
{
    Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
        {
            for (int i = range.Item1; i < range.Item2; ++i)
            {
                if (Buffer[i] < backgroundBuffer[i])
                {
                    Buffer[i] = 0;
                }
                else
                {
                    Buffer[i] -= backgroundBuffer[i];
                }
            }
        });
}

public static void SubtractBackgroundFromBufferPartitionedParallelForEachHack(ushort[] backgroundBuffer)
{
    Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
    {
        for (int i = range.Item1; i < range.Item2; ++i)
        {
            unsafe
            {
                var nonNegative = Buffer[i] > backgroundBuffer[i];
                Buffer[i] = (ushort)((Buffer[i] - backgroundBuffer[i]) *
                    *((int*)(&nonNegative)));
            }
        }
    });
}

Answer 2

This is an interesting question. 这是个有趣的问题。

Only performing the subtraction after testing that the result won't be negative (as suggested by TTat and Maximum Cookie ) has negligible impact, as this optimization can already be performed by the JIT compiler. 只有在测试结果不是负数后才执行减法（如TTat和Maximum Cookie所建议的）影响可以忽略不计，因为这种优化已经可以由JIT编译器执行。

Parallelizing the task (as suggested by Selman22 ) is a good idea, but when the loop is as fast as it is in this case, the overhead ends up outwaying the gains, so Selman22's implementation actually runs slower in my testing. 并行化任务（如Selman22所建议的）是一个好主意，但是当循环速度与此情况一样快时，开销最终会超过增益，因此Selman22的实现在我的测试中实际运行得更慢。 I suspect that nick_w's benchmarks were produced with debugger attached, hiding this fact. 我怀疑nick_w的基准是在附带调试器的情况下产生的，隐藏了这个事实。

Parallelizing the task in larger chunks (as suggested by nick_w ) deals with the overhead problem, and can actually produce faster performance, but you don't have to calculate the chunks yourself - you can use Partitioner to do this for you: 在较大的块中并行化任务（如nick_w所示）处理开销问题，并且实际上可以产生更快的性能，但您不必自己计算块 - 您可以使用Partitioner为您执行此操作：

public static void SubtractBackgroundFromBufferPartitionedParallelForEach(
    ushort[] backgroundBuffer)
{
    Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
        {
            for (int i = range.Item1; i < range.Item2; ++i)
            {
                if (Buffer[i] < backgroundBuffer[i])
                {
                    Buffer[i] = 0;
                }
                else
                {
                    Buffer[i] -= backgroundBuffer[i];
                }
            }
        });
}

The above method consistently outperforms nick_w's hand-rolled chunking in my testing. 在我的测试中，上述方法始终优于nick_w的手卷组块。

But wait! 可是等等！ There's more to it than that. 除此之外还有更多。

The real culprit in slowing down your code is not the assignment or arithmetic. 减慢代码速度的真正罪魁祸首不是赋值或算术。 It's the if statement. 这是if语句。 How it affects performance is going to be majorly impacted by the nature of the data you are processing. 它如何影响性能将受到您正在处理的数据性质的重大影响。

nick_w's benchmarking generates random data of the same magnitude for both buffers. nick_w的基准测试为两个缓冲区生成相同幅度的随机数据。 However, I suspect that it is very likely that you actually have lower average magnitude data in the background buffer. 但是，我怀疑你很可能在后台缓冲区中实际拥有较低的平均幅度数据。 This detail can be significant due to branch prediction (as explained in this classic SO answer ). 由于分支预测，这个细节可能很重要（如本经典SO答案中所述）。

When the value in the background buffer is usually smaller than that in the buffer, the JIT compiler can notice this, and optimize for that branch accordingly. 当后台缓冲区中的值通常小于缓冲区中的值时，JIT编译器会注意到这一点，并相应地优化该分支。 When the data in each buffer is from the same random population there is no way to guess the outcome of the if statement with greater than 50% accuracy. 当每个缓冲区中的数据来自相同的随机群体时，无法猜测if语句的结果，准确度大于50％。 It is this latter scenario that nick_w is benchmarking, and under those conditions we could potentially further optimize your method by using unsafe code to convert a bool to an integer and avoid branching at all. 正是后一种情况， nick_w是基准测试，在这些情况下，我们可以通过使用不安全的代码将bool转换为整数并避免分支来进一步优化您的方法。 (Note that the following code is relying on an implementation detail of how bool's are represented in memory, and while it works for your scenario in .NET 4.5, it is not necessarily a good idea, and is shown here for illustrative purposes.) （请注意，以下代码依赖于bool如何在内存中表示的实现细节，虽然它适用于.NET 4.5中的场景，但它不一定是个好主意，并且在此处显示用于说明目的。）

public static void SubtractBackgroundFromBufferPartitionedParallelForEachHack(
    ushort[] backgroundBuffer)
{
    Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
        {
            for (int i = range.Item1; i < range.Item2; ++i)
            {
                unsafe
                {
                    var nonNegative = Buffer[i] > backgroundBuffer[i];
                    Buffer[i] = (ushort)((Buffer[i] - backgroundBuffer[i]) *
                        *((int*)(&nonNegative)));
                }
            }
        });
}

If you are really looking to shave a bit more time off, then you can follow this approach in a safer manner by switching language to C++/CLI, as that will let you use a boolean in an arithmetic expression without resorting to unsafe code: 如果您真的希望减少更多的时间，那么您可以通过将语言切换到C ++ / CLI以更安全的方式遵循此方法，因为这将允许您在算术表达式中使用布尔值而无需使用不安全的代码：

UInt16 MyCppLib::Maths::SafeSubtraction(UInt16 minuend, UInt16 subtrahend)
{
    return (UInt16)((minuend - subtrahend) * (minuend > subtrahend));
}

You can create a purely managed DLL using C++/CLI exposing the above static method, and then use it in your C# code: 您可以使用C ++ / CLI创建一个纯托管的DLL，公开上面的静态方法，然后在C＃代码中使用它：

public static void SubtractBackgroundFromBufferPartitionedParallelForEachCpp(
    ushort[] backgroundBuffer)
{
    Parallel.ForEach(Partitioner.Create(0, Buffer.Length), range =>
    {
        for (int i = range.Item1; i < range.Item2; ++i)
        {
            Buffer[i] = 
                MyCppLib.Maths.SafeSubtraction(Buffer[i], backgroundBuffer[i]);
        }
    });
}

This outperforms the hacky unsafe C# code above. 这比上面的hacky不安全的C＃代码更胜一筹。 In fact, it is so fast that you could write the whole method using C++/CLI forgetting about parallelization, and it would still out-perform the other techniques. 事实上，它是如此之快，你可以使用C ++ / CLI编写整个方法忘记并行化，它仍然会胜过其他技术。

Using nick_w's test harness , the above method will outperform any of the other suggestions published here so far. 使用nick_w的测试工具，上述方法将胜过迄今为止发布的任何其他建议。 Here are the results I get (1-4 are the cases he tried, and 5-7 are the ones outlined in this answer): 以下是我得到的结果（1-4是他试过的案例，5-7是这个答案中概述的案例）：

1. SubtractBackgroundFromBuffer(ms):                               2,021.37
2. SubtractBackgroundFromBufferWithCalcOpt(ms):                    2,125.80
3. SubtractBackgroundFromBufferParallelFor(ms):                    3,431.58
4. SubtractBackgroundFromBufferBlockParallelFor(ms):               1,401.36
5. SubtractBackgroundFromBufferPartitionedParallelForEach(ms):     1,197.76
6. SubtractBackgroundFromBufferPartitionedParallelForEachHack(ms):   742.72
7. SubtractBackgroundFromBufferPartitionedParallelForEachCpp(ms):    499.27

However , in the scenario I expect you actually have, where background values are typically smaller, successful branch prediction improves results across the board, and the 'hack' to avoid the if statement is actually slower: 但是，在我希望你实际拥有的场景中，背景值通常较小，成功的分支预测可以全面改善结果，并且避免if语句的'hack'实际上更慢：

Here are the results I get using nick_w's test harness when I restrict values in the background buffer to the range 0-6500 (c. 10% of the buffer): 当我将后台缓冲区中的值限制在0-6500范围内0-6500 （c。缓冲区的10％），以下是使用nick_w的测试工具得到的结果：

1. SubtractBackgroundFromBuffer(ms):                                 773.50
2. SubtractBackgroundFromBufferWithCalcOpt(ms):                      915.91
3. SubtractBackgroundFromBufferParallelFor(ms):                    2,458.36
4. SubtractBackgroundFromBufferBlockParallelFor(ms):                 663.76
5. SubtractBackgroundFromBufferPartitionedParallelForEach(ms):       658.05
6. SubtractBackgroundFromBufferPartitionedParallelForEachHack(ms):   762.11
7. SubtractBackgroundFromBufferPartitionedParallelForEachCpp(ms):    494.12

You can see that results 1-5 have all dramatically improved since they are now benefiting from better branch prediction. 您可以看到结果1-5已经大大改善，因为它们现在受益于更好的分支预测。 Results 6 & 7 haven't changed much, since they have avoided branching. 结果6和7没有太大变化，因为他们避免了分支。

This change in data has completely changes things. 这种数据变化彻底改变了一切。 In this scenario, even the fastest all C# solution is now only 15% faster than your original code. 在这种情况下，即使是最快的所有C＃解决方案现在只比原始代码快15％。

Bottom line : be sure to test any method you pick with representative data, or your results will be meaningless. 底线：务必使用代表性数据测试您选择的任何方法，否则您的结果将毫无意义。

Answer 3

You can try Parallel.For : 你可以试试Parallel.For ：

Parallel.For(0, Buffer.Length, (i) =>
{
    int difference = Buffer[i] - backgroundBuffer[i];
    if (difference >= 0)
          Buffer[i] = (ushort) difference;
    else
         Buffer[i] = 0;
});

Update: I have tried it and I see there is a minimal difference in your case,But when the array become bigger the difference is become bigger too 更新：我已经尝试了，我看到你的情况有一个微小的差别，但是当阵列变大时，差异也变大了

在此输入图像描述

Answer 4

You may get a minor performance increase by checking first if the result is going to be negative before actually performing the subtraction. 在实际执行减法之前，首先检查结果是否为负数，可能会略微提高性能。 That way, there is no need to perform the subtraction if the result will be negative. 这样，如果结果为负，则不需要执行减法。 Example: 例：

if (Buffer[index] > backgroundBuffer[index])
    Buffer[index] = (ushort)(Buffer[index] - backgroundBuffer[index]);
else
    Buffer[index] = 0;

Answer 5

Here's a solution that uses Zip() : 这是一个使用Zip()的解决方案：

Buffer = Buffer.Zip<ushort, ushort, ushort>(backgroundBuffer, (x, y) =>
{
    return (ushort)Math.Max(0, x - y);
}).ToArray();

It doesn't perform as well as the other answers, but it's definitely the shortest solution. 它的表现不如其他答案，但它绝对是最短的解决方案。

Answer 6

What about, 关于什么，

Enumerable.Range(0, Buffer.Length).AsParalell().ForAll(i =>
    {
         unsafe
        {
            var nonNegative = Buffer[i] > backgroundBuffer[i];
            Buffer[i] = (ushort)((Buffer[i] - backgroundBuffer[i]) *
                *((int*)(&nonNegative)));
        }
    });

如何在C＃中快速从另一个中减去一个ushort数组？

问题描述

6 个解决方案

解决方案1
4 已采纳 2014-01-16 01:53:41

解决方案2
4 2014-01-16 08:51:27

解决方案3
1 2014-01-16 00:59:45

解决方案4
1 2014-01-16 01:14:43

解决方案5
0 2014-01-16 10:35:34

解决方案6
0 2014-01-16 10:58:20

如何在C＃中快速从另一个中减去一个ushort数组？

问题描述

6 个解决方案

解决方案1 4 已采纳 2014-01-16 01:53:41

解决方案2 4 2014-01-16 08:51:27

解决方案3 1 2014-01-16 00:59:45

解决方案4 1 2014-01-16 01:14:43

解决方案5 0 2014-01-16 10:35:34

解决方案6 0 2014-01-16 10:58:20

解决方案1
4 已采纳 2014-01-16 01:53:41

解决方案2
4 2014-01-16 08:51:27

解决方案3
1 2014-01-16 00:59:45

解决方案4
1 2014-01-16 01:14:43

解决方案5
0 2014-01-16 10:35:34

解决方案6
0 2014-01-16 10:58:20