简体   繁体   English

如何让这个C#循环更快?

[英]How can I make this C# loop faster?

Executive Summary: Reed's answer below is the fastest if you want to stay in C#. 执行摘要:如果你想留在C#中,Reed的答案是最快的。 If you're willing to marshal to C++ (which I am), that's a faster solution. 如果你愿意为C ++(我是)编组,这是一个更快的解决方案。

I have two 55mb ushort arrays in C#. 我在C#中有两个55mb的ushort数组。 I am combining them using the following loop: 我使用以下循环组合它们:

float b = (float)number / 100.0f;
for (int i = 0; i < length; i++)
{
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
}

This code, according to adding DateTime.Now calls before and afterwards, takes 3.5 seconds to run. 根据之前和之后添加DateTime.Now调用,此代码需要3.5秒才能运行。 How can I make it faster? 我怎样才能让它更快?

EDIT : Here is some code that, I think, shows the root of the problem. 编辑 :我认为这是一些代码,它显示了问题的根源。 When the following code is run in a brand new WPF application, I get these timing results: 当在全新的WPF应用程序中运行以下代码时,我得到这些计时结果:

Time elapsed: 00:00:00.4749156 //arrays added directly
Time elapsed: 00:00:00.5907879 //arrays contained in another class
Time elapsed: 00:00:02.8856150 //arrays accessed via accessor methods

So when arrays are walked directly, the time is much faster than if the arrays are inside of another object or container. 因此,当数组直接行走时,时间比数组在另一个对象或容器内的时间快得多。 This code shows that somehow, I'm using an accessor method, rather than accessing the arrays directly. 此代码显示,不知何故,我使用的是访问器方法,而不是直接访问数组。 Even so, the fastest I seem to be able to get is half a second. 即便如此,我似乎能够获得的最快速度是半秒钟。 When I run the second listing of code in C++ with icc, I get: 当我使用icc在C ++中运行第二个代码列表时,我得到:

Run time for pointer walk: 0.0743338

In this case, then, C++ is 7x faster (using icc, not sure if the same performance can be obtained with msvc-- I'm not as familiar with optimizations there). 在这种情况下,C ++的速度提高了7倍(使用icc,不确定msvc是否可以获得相同的性能 - 我对那里的优化并不熟悉)。 Is there any way to get C# near that level of C++ performance, or should I just have C# call my C++ routine? 有没有办法让C#接近C ++性能水平,或者我应该让C#调用我的C ++例程?

Listing 1, C# code: 清单1,C#代码:

public class ArrayHolder
{
    int length;
    public ushort[] output;
    public ushort[] input1;
    public ushort[] input2;
    public ArrayHolder(int inLength)
    {
        length = inLength;
        output = new ushort[length];
        input1 = new ushort[length];
        input2 = new ushort[length];
    }

    public ushort[] getOutput() { return output; }
    public ushort[] getInput1() { return input1; }
    public ushort[] getInput2() { return input2; }
}


/// <summary>
/// Interaction logic for MainWindow.xaml
/// </summary>
public partial class MainWindow : Window
{
    public MainWindow()
    {
        InitializeComponent();


        Random random = new Random();

        int length = 55 * 1024 * 1024;
        ushort[] output = new ushort[length];
        ushort[] input1 = new ushort[length];
        ushort[] input2 = new ushort[length];

        ArrayHolder theArrayHolder = new ArrayHolder(length);

        for (int i = 0; i < length; i++)
        {
            output[i] = (ushort)random.Next(0, 16384);
            input1[i] = (ushort)random.Next(0, 16384);
            input2[i] = (ushort)random.Next(0, 16384);
            theArrayHolder.getOutput()[i] = output[i];
            theArrayHolder.getInput1()[i] = input1[i];
            theArrayHolder.getInput2()[i] = input2[i];
        }

        Stopwatch stopwatch = new Stopwatch(); 
        stopwatch.Start();
        int number = 44;
        float b = (float)number / 100.0f;
        for (int i = 0; i < length; i++)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * (float)input2[i]));
        } 
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.output[i] =
                (ushort)(theArrayHolder.input1[i] +
                (ushort)(b * (float)theArrayHolder.input2[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        for (int i = 0; i < length; i++)
        {
            theArrayHolder.getOutput()[i] =
                (ushort)(theArrayHolder.getInput1()[i] +
                (ushort)(b * (float)theArrayHolder.getInput2()[i]));
        }
        stopwatch.Stop();

        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
    }
}

Listing 2, C++ equivalent: // looptiming.cpp : Defines the entry point for the console application. 清单2,C ++等价物:// looptiming.cpp:定义控制台应用程序的入口点。 // //

#include "stdafx.h"
#include <stdlib.h>
#include <windows.h>
#include <stdio.h>
#include <iostream>


int _tmain(int argc, _TCHAR* argv[])
{

    int length = 55*1024*1024;
    unsigned short* output = new unsigned short[length];
    unsigned short* input1 = new unsigned short[length];
    unsigned short* input2 = new unsigned short[length];
    unsigned short* outPtr = output;
    unsigned short* in1Ptr = input1;
    unsigned short* in2Ptr = input2;
    int i;
    const int max = 16384;
    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = rand()%max;
        *in1Ptr = rand()%max;
        *in2Ptr = rand()%max;
    }

    LARGE_INTEGER ticksPerSecond;
    LARGE_INTEGER tick1, tick2;   // A point in time
    LARGE_INTEGER time;   // For converting tick into real time


    QueryPerformanceCounter(&tick1);

    outPtr = output;
    in1Ptr = input1;
    in2Ptr = input2;
    int number = 44;
    float b = (float)number/100.0f;


    for (i = 0; i < length; ++i, ++outPtr, ++in1Ptr, ++in2Ptr){
        *outPtr = *in1Ptr + (unsigned short)((float)*in2Ptr * b);
    }
    QueryPerformanceCounter(&tick2);
    QueryPerformanceFrequency(&ticksPerSecond);

    time.QuadPart = tick2.QuadPart - tick1.QuadPart;

    std::cout << "Run time for pointer walk: " << (double)time.QuadPart/(double)ticksPerSecond.QuadPart << std::endl;

    return 0;
}

EDIT 2: Enabling /QxHost in the second example drops the time down to 0.0662714 seconds. 编辑2:在第二个示例中启用/ QxHost将时间减少到0.0662714秒。 Modifying the first loop as @Reed suggested gets me down to 修改第一个循环为@Reed建议让我归结为

Time elapsed: 00:00:00.3835017 时间流逝:00:00:00.3835017

So, still not fast enough for a slider. 所以,滑块还不够快。 That time is via the code: 那段时间是通过代码:

        stopwatch.Start();
        Parallel.ForEach(Partitioner.Create(0, length),
         (range) =>
         {
             for (int i = range.Item1; i < range.Item2; i++)
             {
                 output[i] =
                     (ushort)(input1[i] +
                     (ushort)(b * (float)input2[i]));
             }
         });

        stopwatch.Stop();

EDIT 3 As per @Eric Lippert's suggestion, I've rerun the code in C# in release, and, rather than use an attached debugger, just print the results to a dialog. 编辑3根据@Eric Lippert的建议,我在发布时重新运行C#中的代码,而不是使用附加的调试器,只需将结果打印到对话框中。 They are: 他们是:

  • Simple arrays: ~0.273s 简单数组:~0.273s
  • Contained arrays: ~0.330s 包含数组:~0.330s
  • Accessor arrays: ~0.345s 存取器阵列:~0.345s
  • Parallel arrays: ~0.190s 并行阵列:~0.190s

(these numbers come from a 5 run average) (这些数字来自5个平均值)

So the parallel solution is definitely faster than the 3.5 seconds I was getting before, but is still a bit under the 0.074 seconds achievable using the non-icc processor. 所以并行解决方案肯定比我之前获得的3.5秒快,但仍然有点低于使用非icc处理器可实现的0.074秒。 It seems, therefore, that the fastest solution is to compile in release and then marshal to an icc-compiled C++ executable, which makes using a slider possible here. 因此,似乎最快的解决方案是在发布中编译然后编组到icc编译的C ++可执行文件,这使得在这里使用滑块。

EDIT 4: Three more suggestions from @Eric Lippert: change the inside of the for loop from length to array.length, use doubles, and try unsafe code. 编辑4:来自@Eric Lippert的另外三个建议:将for循环的内部从length更改为array.length,使用双精度,并尝试不安全的代码。

For those three, the timing is now: 对于这三个,时间现在是:

  • length: ~0.274s 长度:~0.274s
  • doubles, not floats: ~0.290s 双打,不漂浮:~0.290s
  • unsafe: ~0.376s 不安全:~0.376s

So far, the parallel solution is the big winner. 到目前为止,并行解决方案是最大的赢家。 Although if I could add these via a shader, maybe I could see some kind of speedup there... 虽然如果我可以通过着色器添加这些,也许我可以看到某种加速...

Here's the additional code: 这是附加代码:

        stopwatch.Reset();

        stopwatch.Start();

        double b2 = ((double)number) / 100.0;
        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b2 * (double)input2[i]));
        }

        stopwatch.Stop();
        DoubleArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        stopwatch.Reset();

        stopwatch.Start();

        for (int i = 0; i < output.Length; ++i)
        {
            output[i] =
                (ushort)(input1[i] +
                (ushort)(b * input2[i]));
        }

        stopwatch.Stop();
        LengthArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);
        stopwatch.Reset();

        stopwatch.Start();
        unsafe
        {
            fixed (ushort* outPtr = output, in1Ptr = input1, in2Ptr = input2){
                ushort* outP = outPtr;
                ushort* in1P = in1Ptr;
                ushort* in2P = in2Ptr;
                for (int i = 0; i < output.Length; ++i, ++outP, ++in1P, ++in2P)
                {
                    *outP = (ushort)(*in1P + b * (float)*in2P);
                }
            }
        }

        stopwatch.Stop();
        UnsafeArrayLabel.Content += "\t" + stopwatch.Elapsed.Seconds + "." + stopwatch.Elapsed.Milliseconds;
        Console.WriteLine("Time elapsed: {0}",
            stopwatch.Elapsed);

This should be perfectly parallelizable. 这应该是完全可并行化的。 However, given the small amount of work being done per element, you'll need to handle this with extra care. 但是,考虑到每个元素的工作量很少,您需要特别小心处理。

The proper way to do this (in .NET 4) would be to use Parallel.ForEach in conjunction with a Partitioner: 执行此操作的正确方法(在.NET 4中)将使用Parallel.ForEach与分区程序一起使用:

float b = (float)number / 100.0f;
Parallel.ForEach(Partitioner.Create(0, length), 
(range) =>
{
   for (int i = range.Item1; i < range.Item2; i++)
   {
      image.DataArray[i] = 
          (ushort)(mUIHandler.image1.DataArray[i] + 
          (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   }
});

This will efficiently partition the work across available processing cores in your system, and should provide a decent speedup if you have multiple cores. 这将有效地划分系统中可用处理核心的工作,并且如果您有多个核心,则应提供适当的加速。

That being said, this will, at best, only speed up this operation by the number of cores in your system. 话虽这么说,这最多只会加速系统内核数量的增加。 If you need to speed it up more, you'll likely need to revert to a mix of parallelization and unsafe code. 如果你需要加快速度,你可能需要恢复混合的并行化和不安全的代码。 At that point, it might be worth thinking about alternatives to trying to present this in real time. 在那时,可能值得考虑尝试实时呈现这一点的替代方案。

Assuming you have a lot of these guys, you can attempt to parallelize the operation (and you're using .NET 4): 假设您有很多这样的人,您可以尝试并行化操作(并且您使用的是.NET 4):

Parallel.For(0, length, i=>
   {
       image.DataArray[i] = 
      (ushort)(mUIHandler.image1.DataArray[i] + 
      (ushort)(b * (float)mUIHandler.image2.DataArray[i]));
   });

Of course that is all going to depend on whether or not parallelization of this would be worth it. 当然,这完全取决于这种并行化是否值得。 That statement looks fairly computationally short; 该陈述在计算上看起来很短; accessing indices by number is pretty fast as is. 按编号访问索引的速度非常快。 You might get gains because this loop is being run so many times with that much data. 您可能会获得收益,因为这个循环正在运行那么多次数据。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM