简体   繁体   English

C#中的有效线程同步

[英]Effective thread Synchronization in C#

I have a scenario where I need to search from many binary files (using keys) and combine the results (strings). 我有一个场景,我需要从许多二进制文件(使用键)搜索并结合结果(字符串)。 Until now, I have been doing it in a for loop one file after the other. 到目前为止,我一直在一个for循环的文件中执行它。

foreach (string file in FileSources.Keys)
{
    aggregatedDefinitions.Append(DefinitionLookup(txtSearchWord.Text, file));
}

Since this operation is very slow, I was thinking of using threads, so that I could do IO operations in parallel. 由于此操作非常慢,我考虑使用线程,以便我可以并行执行IO操作。 Is threading the right way to go. 穿线是正确的方法。 If I use threading, how can I ensure that I get the results in the order I want. 如果我使用线程,我怎样才能确保按照我想要的顺序得到结果。

I haven't used Threading until now. 到目前为止我还没有使用线程。 It would be very helpful if you could suggest some materials/books that would help me solve my problem. 如果您可以建议一些可以帮助我解决问题的材料/书籍,将会非常有帮助。

Generally speaking, it's advised to use threading for I/O operations when you're performing one I/O operation on a separate thread from an application's main (typically, GUI) thread. 一般来说,当您在应用程序的主(通常是GUI)线程的单独线程上执行一个 I / O操作时,建议对I / O操作使用线程。 To split many I/O operations off on separate threads in parallel probably isn't going to help you, since the disk can only be accessed by one at a time. 在并行的单独线程上拆分许多 I / O操作可能不会对您有所帮助,因为磁盘一次只能访问一个。

Taking into account the concerns voiced by others about attempting parallel I/Os against a single disk device, it does look like your processing model can be broken up into parts. 考虑到其他人对单个磁盘设备尝试并行I / O所表达的担忧,看起来您的处理模型可能会被分解为多个部分。 You have a Filesources.Keys input list, and the output appears to be simply appending the compute results to aggregatedDefinitions. 您有一个Filesources.Keys输入列表,输出似乎只是将计算结果附加到aggregatedDefinitions。

Here's how you can break that up for processing on multiple threads and preserve the order of your current results: 以下是如何在多个线程上进行处理并保留当前结果的顺序:

First, decide how many threads you are going to use. 首先,确定要使用的线程数。 For compute intensive tasks, there is usually no point in spinning up more threads than you have CPU cores. 对于计算密集型任务,通常没有必要启动比拥有CPU内核更多的线程。 For I/O bound tasks, you can use more threads than CPU cores since the threads will be spending most of their time waiting for I/O completion. 对于I / O绑定任务,您可以使用比CPU核心更多的线程,因为线程将花费大部分时间等待I / O完成。

Let's assume your DefinitionLookup is compute intensive, not I/O intensive, and let's assume you're running on a dual core CPU. 假设您的DefinitionLookup是计算密集型的,而不是I / O密集型的,我们假设您在双核CPU上运行。 In these conditions, two threads would be a good choice. 在这些条件下,两个线程将是一个不错的选择。

Next, break the input up into largish chunks, preserving the order of the inputs. 接下来,将输入分解为较大的块,保留输入的顺序。 For our two thread scenario, send the first half of the FileSources.Keys list to the first thread, and the second half to the second thread. 对于我们的两个线程场景,将FileSources.Keys列表的前半部分发送到第一个线程,将后半部分发送到第二个线程。

In each thread, process the inputs as before, but append the output to a local list object, not the final (shared) aggregatedDefinitions list. 在每个线程中,像以前一样处理输入,但将输出附加到本地列表对象,而不是最终(共享)aggregatedDefinitions列表。

After the threads have finished their processing, have the main thread concatenate each thread's list results into the final aggregatedDefinitions list, in the correct order. 线程完成处理后,让主线程以正确的顺序将每个线程的列表结果连接到最终的aggregatedDefinitions列表中。 (Thread 1 that received the first half of the inputs produces list1, and should be appended to the master list before the output of Thread2's results. (接收到输入的前半部分的线程1产生list1,并且应该在输出Thread2的结果之前附加到主列表中。

Something like this: 像这样的东西:

    static void Mainthread()
    {
        List<string> input = new List<string>();  // fill with data

        int half = input.Count() / 2;
        ManualResetEvent event1 = new ManualResetEvent(false);
        List<string> results1 = null;

        // give the first half of the input to the first thread
        ThreadPool.QueueUserWorkItem(r => ComputeTask(input.GetRange(0, half), out results1, event1));

        ManualResetEvent event2 = new ManualResetEvent(false);
        List<string> results2 = null;

        // second half of input to the second thread
        ThreadPool.QueueUserWorkItem(r => ComputeTask(input.GetRange(half + 1, input.Count() - half), out results2, event2));

        // wait for both tasks to complete
        WaitHandle.WaitAll(new WaitHandle[] {event1, event2});

        // combine the results, preserving order.
        List<string> finalResults = new List<string>();
        finalResults.AddRange(results1);
        finalResults.AddRange(results2);
    }

    static void ComputeTask(List<string> input, out List<string> output, ManualResetEvent signal)
    {
        output = new List<string>();
        foreach (var item in input)
        {
            // do work here
            output.Add(item);
        }

        signal.Set();
    }

Also, even if all the I/O activity is hitting one disk drive, you could get some performance benefit using asynchronous file reads. 此外,即使所有I / O活动都在访问一个磁盘驱动器,您也可以使用异步文件读取获得一些性能优势。 The idea is you could issue the next file read request as soon as you receive the data from the previous file read request, process the data of the previous read, then wait for the completion of the next file read. 我们的想法是,一旦从先前的文件读取请求接收数据,就可以发出下一个文件读取请求,处理先前读取的数据,然后等待下一个文件读取的完成。 This allows you to use the CPU for processing while the disk I/O request is being handled, without explicitly using threads yourself. 这允许您在处理磁盘I / O请求时使用CPU进行处理,而无需自己明确使用线程。

Compare these (pseudo) execution timelines to read and process 4 chunks of data. 比较这些(伪)执行时间线以读取和处理4个数据块。 Assume a file read takes about 500 time units to complete, and processing that data takes about 10 time units. 假设文件读取需要大约500个时间单位来完成,并且处理该数据大约需要10个时间单位。

Synchronous file I/O:  
read (500)
process data (10)
read (500)
process data (10)
read (500)
process data (10)
read (500)
process data (10)
Total time: 2040 time units

Async file I/O:
begin async read 1
async read 1 completed (500)
begin async read 2 / proces data 1 (10)
async read 2 completed (500)
begin async read 3 / proces data 2 (10)
async read 3 completed (500)
begin async read 4 / proces data 3 (10)
async read 4 completed (500)
process data 4 (10)
Total time: 2010 time units

The processing of data 1, 2 and 3 happens during the time that the next read request is pending, so compared to the first execution timeline, you get the processing time essentially for free. 数据1,2和3的处理发生在下一个读取请求待处理期间,因此与第一个执行时间线相比,您可以获得基本免费的处理时间。 The processing of the last data chunk adds to the total time because there is no read operation for it to run concurrent with. 最后一个数据块的处理会增加总时间,因为它没有与其并行运行的读操作。

The scale of these operations (500 for I/O, 10 for compute) is conservative. 这些操作的规模(I / O为500,计算为10)是保守的。 Real I/O's tend to be even larger compared to compute time, many orders of magnitude higher than compute. 与计算时间相比,实际I / O往往更大,比计算高出许多个数量级。 As you can see, when the compute operation is pretty quick you don't get a lot of performance benefit out of all this work. 正如您所看到的,当计算操作非常快时,您无法从所有这些工作中获得很多性能优势。

You can get greater value out of the effort of doing async I/O's if what you're doing in the "free" time is substantial. 如果您在“免费”时间内所做的事情充实,那么您可以从异步I / O的工作中获得更大的价值。 Cryptography or image processing, for example, would likely be a win but string concatenation probably would not be worth it. 例如,加密或图像处理可能是一种胜利,但字符串连接可能不值得。 Writing data to another file could be worthwhile in the async overlap, but as others have noted the benefits will be diminished if all I/O is on the same physical device. 在异步重叠中将数据写入另一个文件可能是值得的,但正如其他人已经注意到,如果所有I / O都在同一个物理设备上,那么好处将会减少。

I second the opinion of Dan, too and Fredrik and to add to it - an attempt to multithread IO against a single disk can potentially instead of improving performance make things worse. 我同意Dan的意见,而且Fredrik并加入其中 - 尝试对单个磁盘进行多线程IO可能而不是改善性能会使事情变得更糟。

Access requests from parallel threads can increase disk thrashing which will make data retrieval from the disk slower than it is now 来自并行线程的访问请求会增加磁盘抖动,这将使磁盘上的数据检索速度比现在慢

If you are using .NET 4.0 you might want to look into Parallel Extensions and the Parallel class. 如果您使用的是.NET 4.0,则可能需要查看Parallel Extensions和Parallel类。 I've written some examples on how to use it in .NET 4.0 with C#. 我已经写了一些关于如何在.NET 4.0中使用C#的例子

You might also want to look into Parallel IO in F# ( Read Don Symes WebLog ) . 您可能还想查看F#中的 Parallel IO (Read Don Symes WebLog) The parts you have that need IO Parallized, you might want to write in F#. 您需要IO Parallized的部分,您可能想用F#编写。

Check memory mapped files in .Net 4.0 , if you are using C# 3.5 check the pinvoke implementations for the topic , it really speeds up the io operations and general performance of your application. 检查.Net 4.0中的内存映射文件,如果您使用C#3.5检查该主题的pinvoke实现,它确实加快了应用程序的io操作和一般性能。 I have an application which calculates md5 on the given folder to find duplicates and uses memory mapped files for file access. 我有一个应用程序,它计算给定文件夹上的md5以查找重复项并使用内存映射文件进行文件访问。 If you need the sample sourcecode and pinvoked memory mapping libraries contact me. 如果您需要示例源代码和pinvoked内存映射库,请与我联系。

http://en.wikipedia.org/wiki/Memory-mapped_file Or Check The Implementation Here http://www.pinvoke.net/default.aspx/kernel32.createfilemapping http://en.wikipedia.org/wiki/Memory-mapped_file或查看此处的实施http://www.pinvoke.net/default.aspx/kernel32.createfilemapping

It will really speedup your io operations without additional threading overhead. 它将真正加速您的io操作,而无需额外的线程开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM