如何讓這個C＃循環更快？

Question

執行摘要：如果你想留在C＃中，Reed的答案是最快的。 如果你願意為C ++（我是）編組，這是一個更快的解決方案。

我在C＃中有兩個55mb的ushort數組。 我使用以下循環組合它們：

float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}

根據之前和之后添加DateTime.Now調用，此代碼需要3.5秒才能運行。 我怎樣才能讓它更快？

編輯：我認為這是一些代碼，它顯示了問題的根源。 當在全新的WPF應用程序中運行以下代碼時，我得到這些計時結果：

Time elapsed: 00:00:00.4749156 //arrays added directly
Time elapsed: 00:00:00.5907879 //arrays contained in another class
Time elapsed: 00:00:02.8856150 //arrays accessed via accessor methods

因此，當數組直接行走時，時間比數組在另一個對象或容器內的時間快得多。 此代碼顯示，不知何故，我使用的是訪問器方法，而不是直接訪問數組。 即便如此，我似乎能夠獲得的最快速度是半秒鍾。 當我使用icc在C ++中運行第二個代碼列表時，我得到：

Run time for pointer walk: 0.0743338

在這種情況下，C ++的速度提高了7倍（使用icc，不確定msvc是否可以獲得相同的性能 - 我對那里的優化並不熟悉）。 有沒有辦法讓C＃接近C ++性能水平，或者我應該讓C＃調用我的C ++例程？

清單1，C＃代碼：

public class ArrayHolder
{
    int length;
    public ushort[] output;
    public ushort[] input1;
    public ushort[] input2;
    public ArrayHolder(int inLength)
    {
        length = inLength;
        output = new ushort[length];
        input1 = new ushort[length];
        input2 = new ushort[length];
    }

    public ushort[] getOutput() { return output; }
    public ushort[] getInput1() { return input1; }
    public ushort[] getInput2() { return input2; }
}


/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
    public MainWindow()
    {
        InitializeComponent();


        Random random = new Random();

        int length = 55 * 1024 * 1024;
        ushort[] output = new ushort[length];
        ushort[] input1 = new ushort[length];
        ushort[] input2 = new ushort[length];

        ArrayHolder theArrayHolder = new ArrayHolder(length);

        for (int i = 0; i < length; i++)
        {
            output[i] = (ushort)random.Next(0, 16384);
            input1[i] = (ushort)random.Next(0, 16384);
            input2[i] = (ushort)random.Next(0, 16384);
            theArrayHolder.getOutput()[i] = output[i];
            theArrayHolder.getInput1()[i] = input1[i];
            theArrayHolder.getInput2()[i] = input2[i];
        }

        Stopwatch stopwatch = new Stopwatch(); 
        stopwatch.Start();
        int number = 44;
        float b = (float)number / 100.0f;
        for (int i = 0; i < length; i++)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * (float)input2[i]));
        } 
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.output[i] =
                (ushort)(theArrayHolder.input1[i] +
                (ushort)(b * (float)theArrayHolder.input2[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.getOutput()[i] =
                (ushort)(theArrayHolder.getInput1()[i] +
                (ushort)(b * (float)theArrayHolder.getInput2()[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
    }
}

清單2，C ++等價物：// looptiming.cpp：定義控制台應用程序的入口點。 //

#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <stdio.h>
#include <iostream>


int _tmain(int argc, _TCHAR* argv[])
{

    int length = 55*1024*1024;
    unsigned short* output = new unsigned short[length];
    unsigned short* input1 = new unsigned short[length];
    unsigned short* input2 = new unsigned short[length];
    unsigned short* outPtr = output;
    unsigned short* in1Ptr = input1;
    unsigned short* in2Ptr = input2;
    int i;
    const int max = 16384;
    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = rand()%max;
        *in1Ptr = rand()%max;
        *in2Ptr = rand()%max;
    }

    LARGE_INTEGER ticksPerSecond;
    LARGE_INTEGER tick1, tick2;   // A point in time
    LARGE_INTEGER time;   // For converting tick into real time


    QueryPerformanceCounter(&tick1);

    outPtr = output;
    in1Ptr = input1;
    in2Ptr = input2;
    int number = 44;
    float b = (float)number/100.0f;


    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = *in1Ptr + (unsigned short)((float)*in2Ptr * b);
    }
    QueryPerformanceCounter(&tick2);
    QueryPerformanceFrequency(&ticksPerSecond);

    time.QuadPart = tick2.QuadPart - tick1.QuadPart;

    std::cout << "Run time for pointer walk: " << (double)time.QuadPart/(double)ticksPerSecond.QuadPart << std::endl;

    return 0;
}

編輯2：在第二個示例中啟用/ QxHost將時間減少到0.0662714秒。 修改第一個循環為@Reed建議讓我歸結為

時間流逝：00：00：00.3835017

所以，滑塊還不夠快。 那段時間是通過代碼：

        stopwatch.Start();
        Parallel.ForEach(Partitioner.Create(0, length),
         (range) =>
         {
             for (int i = range.Item1; i < range.Item2; i++)
             {
                 output[i] =
                     (ushort)(input1[i] +
                     (ushort)(b * (float)input2[i]));
             }
         });

        stopwatch.Stop();

編輯3根據@Eric Lippert的建議，我在發布時重新運行C＃中的代碼，而不是使用附加的調試器，只需將結果打印到對話框中。 他們是：

簡單數組：~0.273s
包含數組：~0.330s
存取器陣列：~0.345s
並行陣列：~0.190s

（這些數字來自5個平均值）

所以並行解決方案肯定比我之前獲得的3.5秒快，但仍然有點低於使用非icc處理器可實現的0.074秒。 因此，似乎最快的解決方案是在發布中編譯然后編組到icc編譯的C ++可執行文件，這使得在這里使用滑塊。

編輯4：來自@Eric Lippert的另外三個建議：將for循環的內部從length更改為array.length，使用雙精度，並嘗試不安全的代碼。

對於這三個，時間現在是：

長度：~0.274s
雙打，不漂浮：~0.290s
不安全：~0.376s

到目前為止，並行解決方案是最大的贏家。 雖然如果我可以通過着色器添加這些，也許我可以看到某種加速...

這是附加代碼：

        stopwatch.Reset();

        stopwatch.Start();

        double b2 = ((double)number) / 100.0;
        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b2 * (double)input2[i]));
        }

        stopwatch.Stop();
        DoubleArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        stopwatch.Reset();

        stopwatch.Start();

        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * input2[i]));
        }

        stopwatch.Stop();
        LengthArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        unsafe
        {
            fixed (ushort* outPtr = output, in1Ptr = input1, in2Ptr = input2){
                ushort* outP = outPtr;
                ushort* in1P = in1Ptr;
                ushort* in2P = in2Ptr;
                for (int i = 0; i < output.Length; ++i, ++outP, ++in1P, ++in2P)
                {
                    *outP = (ushort)(*in1P + b * (float)*in2P);
                }
            }
        }

        stopwatch.Stop();
        UnsafeArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);

Answer 1

這應該是完全可並行化的。 但是，考慮到每個元素的工作量很少，您需要特別小心處理。

執行此操作的正確方法（在.NET 4中）將使用Parallel.ForEach與分區程序一起使用：

float b = (float)number / 100.0f;
Parallel.ForEach(Partitioner.Create(0, length), 
(range) =>
{
   for (int i = range.Item1; i < range.Item2; i++)
   {
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   }
});

這將有效地划分系統中可用處理核心的工作，並且如果您有多個核心，則應提供適當的加速。

話雖這么說，這最多只會加速系統內核數量的增加。 如果你需要加快速度，你可能需要恢復混合的並行化和不安全的代碼。 在那時，可能值得考慮嘗試實時呈現這一點的替代方案。

Answer 2

假設您有很多這樣的人，您可以嘗試並行化操作（並且您使用的是.NET 4）：

Parallel.For(0, length, i=>
   {
       image.DataArray[i] = 
      (ushort)(mUIHandler.image1.DataArray[i] + 
      (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   });

當然，這完全取決於這種並行化是否值得。 該陳述在計算上看起來很短; 按編號訪問索引的速度非常快。 您可能會獲得收益，因為這個循環正在運行那么多次數據。

如何讓這個C＃循環更快？

問題描述

2 個解決方案

解決方案1
19 已采納 2011-05-12 19:52:15

解決方案2
7 2011-05-12 19:48:04

如何讓這個C＃循環更快？

問題描述

2 個解決方案

解決方案1 19 已采納 2011-05-12 19:52:15

解決方案2 7 2011-05-12 19:48:04

解決方案1
19 已采納 2011-05-12 19:52:15

解決方案2
7 2011-05-12 19:48:04