简体繁体 English

并行处理队列的好策略是什么？

[英]What's a good strategy for processing a queue in parallel?

原文 2011-05-08 06:13:49 6 3 c#/ queue/ deadlock/ parallel-processing

I'm writing a program which needs to recursively search through a folder structure, and would like to do so in parallel with several threads. 我正在编写一个程序，需要递归搜索文件夹结构，并希望与多个线程并行执行。

I've written the rather trivial synchronous method already - adding the root directory to the queue initially, then dequeuing a directory, queuing its subdirectories, etc., until the queue is empty. 我已经编写了相当简单的同步方法 - 最初将根目录添加到队列，然后将目录出列，对其子目录进行排队等，直到队列为空。 I'll use a ConcurrentQueue<T> for my queue, but have already realized that my loops will stop prematurely. 我会为我的队列使用ConcurrentQueue<T> ，但已经意识到我的循环会过早停止。 The first thread will dequeue the root directory, and immediately every other thread could see that the queue is empty and exit, leaving the first thread as the only one running. 第一个线程将使根目录出列，并且每个其他线程立即看到队列为空并退出，使第一个线程成为唯一运行的线程。 I would like each thread to loop until the queue is empty, then wait until another thread queues some more directories, and keep going. 我希望每个线程循环直到队列为空，然后等待另一个线程排队更多的目录，然后继续。 I need some sort of checkpoint in my loop so that none of the threads will exit until every thread has reached the end of the loop, but I'm not sure the best way to do this without deadlocking when there really are no more directories to process. 我需要在循环中使用某种检查点，以便在每个线程都到达循环结束之前不会退出任何线程，但是我不确定最好的方法是在没有死锁的情况下执行此操作，而实际上没有更多的目录处理。

3 个解决方案

Use the Task Parallel Library . 使用任务并行库。

Create a Task to process the first folder. 创建Task以处理第一个文件夹。 In this create a Task to process each subfolder (recursively) and a task for each relevant file. 在此创建一个Task来处理每个子文件夹（递归）和每个相关文件的任务。 Then wait on all the tasks for this folder. 然后等待此文件夹的所有任务。

The TPL runtime will make use of the thread pool avoiding creating threads, which is an expensive operation. TPL运行时将使用线程池来避免创建线程，这是一项昂贵的操作。 for small pieces of work. 对于小件工作。

Note: 注意：

If the work per file is trivial do it inline rather than creating another task (IO performance will be the limiting factor). 如果每个文件的工作是微不足道的，那么它是内联的而不是创建另一个任务（IO性能将是限制因素）。
This approach will generally work best if blocking operations are avoided, but if IO performance is the limit then this might not matter anyway—start simple and measure. 如果避免阻塞操作，这种方法通常效果最好，但如果IO性能是限制，那么无论如何这可能无关紧要 - 开始简单和测量。
Before .NET 4 much of this can be done with the thread pool, but you'll need to use events to wait for tasks to complete, and that waiting will tie up thread pool threads. 在.NET 4之前，可以使用线程池完成大部分工作，但是您需要使用事件来等待任务完成，并且等待将占用线程池线程。 ¹ ¹

¹ As I understand it, in the TPL when waiting on tasks—using a TPL method—TPL will reuse that thread for other tasks until the wait is fulfilled. ¹据我了解，在TPL等待任务时 - 使用TPL方法 - TPL将重用该线程用于其他任务，直到等待完成为止。

If you want to stick to the concept of an explicit queue have a look on the BlockingCollection class. 如果你想坚持一个显式队列的概念，请看看BlockingCollection类。 The method GetConsumingEnumerable() returns a IEnumerable which blocks, when the collection has run out of items and continues as soon new items are available. 方法GetConsumingEnumerable（）返回一个IEnumerable，当集合用完项目时会阻塞，并在新项目可用时继续。 This means whenever the collection is empty the thread is blocked and thus prevents a premature stop of it. 这意味着无论何时集合为空，线程都会被阻塞，从而防止其过早停止。

However: Basically this is very useful for producer-consumer scenarios. 但是：基本上这对生产者 - 消费者场景非常有用。 I am not sure if your problem falls into this category. 我不确定你的问题是否属于这一类。

It would seem like in this case that your best bet would be to create one thread to start, then whenever you load sub-directories, you should task threads from the thread pool to handle them. 在这种情况下，似乎最好的办法是创建一个线程来启动，然后每当你加载子目录时，你应该从线程池中的任务线程来处理它们。 Allow your threads to exit when they are done and call new ones from the pool every time you go one step further into the directories. 允许您的线程在完成后退出，并在每次进一步进入目录时从池中调用新线程。 This way there is no deadlock and your system uses threads as it needs them. 这样就没有死锁，系统会根据需要使用线程。 You could even specify how many threads to start based upon how many folders were found. 您甚至可以根据找到的文件夹数指定要启动的线程数。

Edit: Changed the above to be more clear that you don't want to explicitly create new threads but instead you want to take advantage of the thread pool to add and remove threads as needed without the overhead. 编辑：更改上面的内容更清楚，您不希望显式创建新线程，而是希望利用线程池根据需要添加和删除线程，而无需开销。