為什么Parallel.ForEach比AsParallel（）。ForAll（）要快得多，即使MSDN另有建議？

Question

我一直在做一些調查，看看我們如何創建一個貫穿樹的多線程應用程序。

為了找到如何以最佳方式實現這一點，我創建了一個運行在我的C：\\ disk中的測試應用程序並打開所有目錄。

class Program
{
    static void Main(string[] args)
    {
        //var startDirectory = @"C:\The folder\RecursiveFolder";
        var startDirectory = @"C:\";

        var w = Stopwatch.StartNew();

        ThisIsARecursiveFunction(startDirectory);

        Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);

        Console.ReadKey();
    }

    public static void ThisIsARecursiveFunction(String currentDirectory)
    {
        var lastBit = Path.GetFileName(currentDirectory);
        var depth = currentDirectory.Count(t => t == '\\');
        //Console.WriteLine(depth + ": " + currentDirectory);

        try
        {
            var children = Directory.GetDirectories(currentDirectory);

            //Edit this mode to switch what way of parallelization it should use
            int mode = 3;

            switch (mode)
            {
                case 1:
                    foreach (var child in children)
                    {
                        ThisIsARecursiveFunction(child);
                    }
                    break;
                case 2:
                    children.AsParallel().ForAll(t =>
                    {
                        ThisIsARecursiveFunction(t);
                    });
                    break;
                case 3:
                    Parallel.ForEach(children, t =>
                    {
                        ThisIsARecursiveFunction(t);
                    });
                    break;
                default:
                    break;
            }

        }
        catch (Exception eee)
        {
            //Exception might occur for directories that can't be accessed.
        }
    }
}

然而，我遇到的是，當在模式3（Parallel.ForEach）中運行時，代碼在大約2.5秒內完成（是的，我有一個SSD;））。 在沒有並行化的情況下運行代碼，它在大約8秒內完成。 並且在模式2（AsParalle.ForAll（））中運行代碼需要幾乎無限的時間。

在檢查進程資源管理器時，我也遇到了一些奇怪的事實：

Mode1 (No Parallelization):
Cpu:     ~25%
Threads: 3
Time to complete: ~8 seconds

Mode2 (AsParallel().ForAll()):
Cpu:     ~0%
Threads: Increasing by one per second (I find this strange since it seems to be waiting on the other threads to complete or a second timeout.)
Time to complete: 1 second per node so about 3 days???

Mode3 (Parallel.ForEach()):
Cpu:     100%
Threads: At most 29-30
Time to complete: ~2.5 seconds

我發現特別奇怪的是Parallel.ForEach似乎忽略了在AsParallel（）時仍然運行的任何父線程/任務.ForAll（）似乎等待前一個任務完成（這將不會很快，因為所有父任務仍在等待他們的孩子任務完成）。

我在MSDN上讀到的內容也是：“在可能的情況下，所有人都願意為我們做好准備”

資料來源： http ： //msdn.microsoft.com/en-us/library/dd997403（v = vs.110）.aspx

有沒有人知道為什么會這樣？

編輯1：

根據Matthew Watson的要求，我首先將樹加載到內存中，然后循環遍歷它。 現在按順序完成樹的加載。

但結果是一樣的。 Unparallelized和Parallel.ForEach現在在大約0.05秒內完成整個樹，而AsParallel（）。ForAll仍然只有每秒1步。

碼：

class Program
{
    private static DirWithSubDirs RootDir;

    static void Main(string[] args)
    {
        //var startDirectory = @"C:\The folder\RecursiveFolder";
        var startDirectory = @"C:\";

        Console.WriteLine("Loading file system into memory...");
        RootDir = new DirWithSubDirs(startDirectory);
        Console.WriteLine("Done");


        var w = Stopwatch.StartNew();

        ThisIsARecursiveFunctionInMemory(RootDir);

        Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);

        Console.ReadKey();
    }        

    public static void ThisIsARecursiveFunctionInMemory(DirWithSubDirs currentDirectory)
    {
        var depth = currentDirectory.Path.Count(t => t == '\\');
        Console.WriteLine(depth + ": " + currentDirectory.Path);

        var children = currentDirectory.SubDirs;

        //Edit this mode to switch what way of parallelization it should use
        int mode = 2;

        switch (mode)
        {
            case 1:
                foreach (var child in children)
                {
                    ThisIsARecursiveFunctionInMemory(child);
                }
                break;
            case 2:
                children.AsParallel().ForAll(t =>
                {
                    ThisIsARecursiveFunctionInMemory(t);
                });
                break;
            case 3:
                Parallel.ForEach(children, t =>
                {
                    ThisIsARecursiveFunctionInMemory(t);
                });
                break;
            default:
                break;
        }
    }
}

class DirWithSubDirs
{
    public List<DirWithSubDirs> SubDirs = new List<DirWithSubDirs>();
    public String Path { get; private set; }

    public DirWithSubDirs(String path)
    {
        this.Path = path;
        try
        {
            SubDirs = Directory.GetDirectories(path).Select(t => new DirWithSubDirs(t)).ToList();
        }
        catch (Exception eee)
        {
            //Ignore directories that can't be accessed
        }
    }
}

編輯2：

在閱讀Matthew評論的更新后，我嘗試將以下代碼添加到程序中：

ThreadPool.SetMinThreads(4000, 16);
ThreadPool.SetMaxThreads(4000, 16);

然而，這並沒有改變AsParallel的形狀。 在減速到1步/秒之前，前8個步驟仍在執行。

（額外注意，我當前忽略了當我無法通過Directory.GetDirectories（）周圍的Try Catch塊訪問目錄時發生的異常）

編輯3：

另外我最感興趣的是Parallel.ForEach和AsParallel.ForAll之間的區別，因為對我而言，由於某種原因，第二個為每次遞歸創建一個Thread而第一個曾經處理大約30個線程中的所有內容，這很奇怪最大。 （以及為什么MSDN建議使用AsParallel，即使它創建了如此多的線程，超時約1秒）

編輯4：

我發現的另一個奇怪的事情是：當我嘗試在線程池上設置MinThreads時，它似乎忽略了該值並縮小到8或16左右：ThreadPool.SetMinThreads（1023,16）;

仍然當我使用1023時，它會非常快地完成前1023個元素，然后回到我一直經歷的慢速節奏。

注意：現在還創建了超過1000個線程（相比整個Parallel.ForEach一個30）。

這是否意味着Parallel.ForEach在處理任務方面更聰明？

更多信息，當您將值設置為1023時，此代碼打印兩次8 - 8 :(當您將值設置為1023或更低時，它會打印正確的值）

        int threadsMin;
        int completionMin;
        ThreadPool.GetMinThreads(out threadsMin, out completionMin);
        Console.WriteLine("Cur min threads: " + threadsMin + " and the other thing: " + completionMin);

        ThreadPool.SetMinThreads(1023, 16);
        ThreadPool.SetMaxThreads(1023, 16);

        ThreadPool.GetMinThreads(out threadsMin, out completionMin);
        Console.WriteLine("Now min threads: " + threadsMin + " and the other thing: " + completionMin);

編輯5：

根據Dean的要求，我創建了另一個案例來手動創建任務：

case 4:
    var taskList = new List<Task>();
    foreach (var todo in children)
    {
        var itemTodo = todo;
        taskList.Add(Task.Run(() => ThisIsARecursiveFunctionInMemory(itemTodo)));
    }
    Task.WaitAll(taskList.ToArray());
    break;

這也和Parallel.ForEach（）循環一樣快。 所以我們仍然沒有答案為什么AsParallel（）。ForAll（）是如此慢。

Answer 1

這個問題很可調試，當你遇到線程問題時，這種情況很不尋常。 這里的基本工具是Debug> Windows> Threads調試器窗口。 顯示活動線程並讓您查看其堆棧跟蹤。 你會很容易看到，一旦它變慢，你就會有幾十個活躍的線程都被卡住了。 他們的堆棧跟蹤看起來都一樣：

    mscorlib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout, bool exitContext) + 0x16 bytes  
    mscorlib.dll!System.Threading.Monitor.Wait(object obj, int millisecondsTimeout) + 0x7 bytes 
    mscorlib.dll!System.Threading.ManualResetEventSlim.Wait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) + 0x182 bytes    
    mscorlib.dll!System.Threading.Tasks.Task.SpinThenBlockingWait(int millisecondsTimeout, System.Threading.CancellationToken cancellationToken) + 0x93 bytes   
    mscorlib.dll!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler scheduler, bool waitForCompletion) + 0xba bytes  
    mscorlib.dll!System.Threading.Tasks.Task.RunSynchronously(System.Threading.Tasks.TaskScheduler scheduler) + 0x13 bytes  
    System.Core.dll!System.Linq.Parallel.SpoolingTask.SpoolForAll<ConsoleApplication1.DirWithSubDirs,int>(System.Linq.Parallel.QueryTaskGroupState groupState, System.Linq.Parallel.PartitionedStream<ConsoleApplication1.DirWithSubDirs,int> partitions, System.Threading.Tasks.TaskScheduler taskScheduler) Line 172  C#
// etc..

每當你看到這樣的東西時，你應該立即想到防火管的問題 。 在比賽和僵局之后，可能是第三個最常見的線程錯誤。

您可以推斷，既然您知道原因，那么代碼的問題在於每個完成的線程都會增加N個線程。 其中N是目錄中的平均子目錄數。 實際上，線程數呈指數增長 ，這總是很糟糕。 如果N = 1，它將保持控制，當然，這在典型的磁盤上永遠不會發生。

請注意，就像幾乎任何線程問題一樣，這種不當行為往往會重演得很糟糕。 機器中的SSD往往會隱藏它。 您的機器中的RAM也是如此，程序可能會在第二次運行時快速完成並且無故障。 因為您現在將從文件系統緩存而不是磁盤讀取，速度非常快。 修補ThreadPool.SetMinThreads（）也隱藏它，但它無法修復它。 它永遠不會修復任何問題，它只會隱藏它們。 因為無論發生什么，指數數字總是會超過設定的最小線程數。 您只能希望它在完成之前完成對驅動器的迭代。 對於擁有大驅動器的用戶來說，空閑的希望。

ParallelEnumerable.ForAll（）和Parallel.ForEach（）之間的區別現在也許很容易解釋。 你可以從堆棧跟蹤中看出ForAll（）做了一些頑皮的事情，RunSynchronously（）方法會阻塞，直到完成所有線程。 阻塞是線程池線程不應該做的事情，它會使線程池變得粗糙並且不允許它為另一個工作安排處理器。 並且具有您觀察到的效果，線程池很快就被等待N個其他線程完成的線程所淹沒。 沒有發生這種情況，他們正在游泳池中等待並且沒有安排，因為已經有很多活躍的。

這是一個死鎖場景，非常常見，但線程池管理器有一個解決方法。 它會監視活動的線程池線程，並在它們未及時完成時進入。 然后它允許一個額外的線程啟動，比SetMinThreads（）設置的最小線程多一個。 但不超過SetMaxThreads（）設置的最大值，有太多活動的tp線程是有風險的，可能會觸發OOM。 這確實解決了死鎖，它完成了一個ForAll（）調用。 但這種情況發生的速度非常慢，線程池每秒只執行兩次。 在趕上之前你會沒有耐心。

Parallel.ForEach（）沒有這個問題，它不會阻塞所以不會使池中的問題。

似乎是解決方案，但請記住，您的程序仍在消耗機器的內存，向池中添加更多等待的tp線程。 這也可能導致你的程序崩潰，因為你有很多內存並且線程池沒有使用很多來跟蹤請求，所以它不太可能。 然而，一些程序員也實現了這一點。

解決方案非常簡單，只是不要使用線程。 這是有害的 ，只有一個磁盤時沒有並發性。 它不喜歡被多個線程被征用。 在主軸驅動器上特別糟糕，頭部搜索非常非常慢。 SSD可以做得更好，但它仍然需要50微秒，這是您不想要或不需要的開銷。 線程的理想數目來訪問磁盤，你不能指望，否則將被緩存以及始終是一個。

Answer 2

首先要注意的是，您正在嘗試並行化IO綁定操作，這將顯着扭曲時序。

需要注意的第二件事是並行化任務的本質：您遞歸地降序目錄樹。 如果您創建多個線程來執行此操作，則每個線程可能同時訪問磁盤的不同部分 - 這將導致磁盤讀取頭跳到整個位置並大大減慢速度。

嘗試更改測試以創建內存中的樹，並使用多個線程來訪問它。 然后，您將能夠正確地比較時間，而不會使結果失真超出所有有用性。

此外，您可能正在創建大量線程，並且它們（默認情況下）將是線程池線程。 擁有大量線程實際上會在超過處理器內核數量時降低速度。

另請注意，當您超過線程池最小線程數（由ThreadPool.GetMinThreads()定義）時，線程池管理器會在每個新線程池線程創建之間引入延遲。 （我認為每個新線程大約0.5秒）。

此外，如果線程數超過ThreadPool.GetMaxThreads()返回的值，則創建線程將阻塞，直到其他線程退出。 我想這可能會發生。

您可以通過調用ThreadPool.SetMaxThreads()和ThreadPool.SetMinThreads()來測試此假設，以增加這些值，並查看它是否有任何區別。

（最后，請注意，如果您真的嘗試以遞歸方式從C:\\下降，當它到達受保護的OS文件夾時，幾乎肯定會遇到IO異常。）

注意：設置最大/最小線程池線程，如下所示：

ThreadPool.SetMinThreads(4000, 16);
ThreadPool.SetMaxThreads(4000, 16);

跟進

我已嘗試使用如上所述設置的線程池線程計數的測試代碼，具有以下結果（不是在我的整個C：\\驅動器上運行，而是在較小的子集上運行）：

模式1耗時06.5秒。
模式2花了15.7秒。
模式3耗時16.4秒。

這符合我的期望; 添加一個線程加載實際上使它比單線程慢，並且兩個並行方法大致相同的時間。

如果其他人想要調查這個，這里有一些確定性的測試代碼（OP的代碼不可重現，因為我們不知道他的目錄結構）。

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Linq;
using System.Threading.Tasks;

namespace Demo
{
    internal class Program
    {
        private static DirWithSubDirs RootDir;

        private static void Main()
        {
            Console.WriteLine("Loading file system into memory...");
            RootDir = new DirWithSubDirs("Root", 4, 4);
            Console.WriteLine("Done");

            //ThreadPool.SetMinThreads(4000, 16);
            //ThreadPool.SetMaxThreads(4000, 16);

            var w = Stopwatch.StartNew();
            ThisIsARecursiveFunctionInMemory(RootDir);

            Console.WriteLine("Elapsed seconds: " + w.Elapsed.TotalSeconds);
            Console.ReadKey();
        }

        public static void ThisIsARecursiveFunctionInMemory(DirWithSubDirs currentDirectory)
        {
            var depth = currentDirectory.Path.Count(t => t == '\\');
            Console.WriteLine(depth + ": " + currentDirectory.Path);

            var children = currentDirectory.SubDirs;

            //Edit this mode to switch what way of parallelization it should use
            int mode = 3;

            switch (mode)
            {
                case 1:
                    foreach (var child in children)
                    {
                        ThisIsARecursiveFunctionInMemory(child);
                    }
                    break;

                case 2:
                    children.AsParallel().ForAll(t =>
                    {
                        ThisIsARecursiveFunctionInMemory(t);
                    });
                    break;

                case 3:
                    Parallel.ForEach(children, t =>
                    {
                        ThisIsARecursiveFunctionInMemory(t);
                    });
                    break;

                default:
                    break;
            }
        }
    }

    internal class DirWithSubDirs
    {
        public List<DirWithSubDirs> SubDirs = new List<DirWithSubDirs>();

        public String Path { get; private set; }

        public DirWithSubDirs(String path, int width, int depth)
        {
            this.Path = path;

            if (depth > 0)
                for (int i = 0; i < width; ++i)
                    SubDirs.Add(new DirWithSubDirs(path + "\\" + i, width, depth - 1));
        }
    }
}

Answer 3

Parallel.For和.ForEach方法在內部實現，等同於在Tasks中運行迭代，例如，循環如下：

Parallel.For(0, N, i => 
{ 
  DoWork(i); 
});

相當於：

var tasks = new List<Task>(N); 
for(int i=0; i<N; i++) 
{ 
tasks.Add(Task.Factory.StartNew(state => DoWork((int)state), i)); 
} 
Task.WaitAll(tasks.ToArray());

並且從每次迭代的角度來看，可能與其他迭代並行運行，這是一個好的心智模型，但不會發生現實。 事實上，並行不一定每次迭代使用一個任務，因為這比必要的開銷要多得多。 Parallel.ForEach嘗試盡可能快地使用完成循環所需的最少任務數。 當線程變得可用於處理這些任務時，它會旋轉任務，並且每個任務都參與管理方案（我認為它稱為分塊）：任務要求完成多次迭代，獲取它們，然后處理工作，然后回去更多。 塊大小根據參與的任務數量，機器上的負載等而變化。

PLINQ的.AsParallel（）有不同的實現，但它“仍然可以”類似地將多次迭代提取到臨時存儲中，在線程中進行計算（但不是作為任務），並將查詢結果放入一個小緩沖區。 （你得到一些基於ParallelQuery的東西，然后進一步.Whatever（）函數綁定到另一組提供並行實現的擴展方法）。

現在我們對這兩種機制的工作原理有了一個小小的想法，我將嘗試回答你原來的問題：

那么為什么.AsParallel（）比Parallel.ForEach慢 ？ 原因源於以下幾點。 任務（或其在此處的等效實現）不會阻止類似I / O的調用。 他們'等待'並釋放CPU以做其他事情。 但是（引用C＃nutshell book）：“ PLINQ無法在不阻塞線程的情況下執行I / O綁定工作 ”。 這些電話是同步的 。 編寫它們的目的是為了增加並行度，如果（並且只是如果）您正在執行諸如每個不占用CPU時間的任務下載網頁之類的事情。

你的函數調用完全類似於I / O綁定調用的原因是：你的一個線程（稱之為T）阻塞並且在它的所有子線程完成之前什么都不做，這在這里可能是一個緩慢的過程。 T本身不是CPU密集型，而是等待孩子們解除阻塞， 它只是在等待 。 因此，它與典型的I / O綁定函數調用相同。

Answer 4

根據AsParallel究竟如何工作的公認答案？

.AsParallel.ForAll()在調用.ForAll()之前強制轉換回IEnumerable

所以它創建了1個新線程+ N個遞歸調用（每個調用生成一個新線程）。

為什么Parallel.ForEach比AsParallel（）。ForAll（）要快得多，即使MSDN另有建議？

問題描述

4 個解決方案

解決方案1
45 已采納 2014-09-20 15:27:02

解決方案2
6 2014-09-18 08:40:52

解決方案3
3 2014-09-24 20:24:35

解決方案4
0

為什么Parallel.ForEach比AsParallel（）。ForAll（）要快得多，即使MSDN另有建議？

問題描述

4 個解決方案

解決方案1 45 已采納 2014-09-20 15:27:02

解決方案2 6 2014-09-18 08:40:52

解決方案3 3 2014-09-24 20:24:35

解決方案4 0

解決方案1
45 已采納 2014-09-20 15:27:02

解決方案2
6 2014-09-18 08:40:52

解決方案3
3 2014-09-24 20:24:35

解決方案4
0