Parallel.ForEach slower than normal foreach

Question

I'm playing around with the Parallel.ForEach in a C# console application, but can't seem to get it right. I'm creating an array with random numbers and i have a sequential foreach and a Parallel.ForEach that finds the largest value in the array. With approximately the same code in c++ i started to see a tradeoff to using several threads at 3M values in the array. But the Parallel.ForEach is twice as slow even at 100M values. What am i doing wrong?

class Program
{
    static void Main(string[] args)
    {
        dostuff();

    }

    static void dostuff() {
        Console.WriteLine("How large do you want the array to be?");
        int size = int.Parse(Console.ReadLine());

        int[] arr = new int[size];
        Random rand = new Random();
        for (int i = 0; i < size; i++)
        {
            arr[i] = rand.Next(0, int.MaxValue);
        }

        var watchSeq = System.Diagnostics.Stopwatch.StartNew();
        var largestSeq = FindLargestSequentially(arr);
        watchSeq.Stop();
        var elapsedSeq = watchSeq.ElapsedMilliseconds;
        Console.WriteLine("Finished sequential in: " + elapsedSeq + "ms. Largest = " + largestSeq);

        var watchPar = System.Diagnostics.Stopwatch.StartNew();
        var largestPar = FindLargestParallel(arr);
        watchPar.Stop();
        var elapsedPar = watchPar.ElapsedMilliseconds;
        Console.WriteLine("Finished parallel in: " + elapsedPar + "ms Largest = " + largestPar);

        dostuff();
    }

    static int FindLargestSequentially(int[] arr) {
        int largest = arr[0];
        foreach (int i in arr) {
            if (largest < i) {
                largest = i;
            }
        }
        return largest;
    }

    static int FindLargestParallel(int[] arr) {
        int largest = arr[0];
        Parallel.ForEach<int, int>(arr, () => 0, (i, loop, subtotal) =>
        {
            if (i > subtotal)
                subtotal = i;
            return subtotal;
        },
        (finalResult) => {
            Console.WriteLine("Thread finished with result: " + finalResult);
            if (largest < finalResult) largest = finalResult;
        }
        );
        return largest;
    }
}

Answer 1

It's performance ramifications of having a very small delegate body.

We can achieve better performance using the partitioning. In this case the body delegate performs work with a high data volume.

static int FindLargestParallelRange(int[] arr)
{
    object locker = new object();
    int largest = arr[0];
    Parallel.ForEach(Partitioner.Create(0, arr.Length), () => arr[0], (range, loop, subtotal) =>
    {
        for (int i = range.Item1; i < range.Item2; i++)
            if (arr[i] > subtotal)
                subtotal = arr[i];
        return subtotal;
    },
    (finalResult) =>
    {
        lock (locker)
            if (largest < finalResult)
                largest = finalResult;
    });
    return largest;
}

Pay attention to synchronize the localFinally delegate. Also note the need for proper initialization of the localInit: () => arr[0] instead of () => 0 .

Partitioning with PLINQ:

static int FindLargestPlinqRange(int[] arr)
{
    return Partitioner.Create(0, arr.Length)
        .AsParallel()
        .Select(range =>
        {
            int largest = arr[0];
            for (int i = range.Item1; i < range.Item2; i++)
                if (arr[i] > largest)
                    largest = arr[i];
            return largest;
        })
        .Max();
}

I highly recommend free book Patterns of Parallel Programming by Stephen Toub.

Answer 2

As the other answerers have mentioned, the action you're trying to perform against each item here is so insignificant that there are a variety of other factors which end up carrying more weight than the actual work you're doing. These may include:

JIT optimizations
CPU branch prediction
I/O (outputting thread results while the timer is running)
the cost of invoking delegates
the cost of task management
the system incorrectly guessing what thread strategy will be optimal
memory/cpu caching
memory pressure
environment (debugging)
etc.

Running each approach a single time is not an adequate way to test, because it enables a number of the above factors to weigh more heavily on one iteration than on another. You should start with a more robust benchmarking strategy.

Furthermore, your implementation is actually dangerously incorrect. The documentation specifically says:

The localFinally delegate is invoked once per task to perform a final action on each task's local state. This delegate might be invoked concurrently on multiple tasks; therefore, you must synchronize access to any shared variables.

You have not synchronized your final delegate, so your function is prone to race conditions that would make it produce incorrect results.

As in most cases, the best approach to this one is to take advantage of work done by people smarter than we are. In my testing , the following approach appears to be the fastest overall:

return arr.AsParallel().Max();

Answer 3

The Parallel Foreach loop should be running slower because the algorithm used is not parallel and a lot more work is being done to run this algorithm.

In the single thread, to find the max value, we can take the first number as our max value and compare it to every other number in the array. If one of the numbers larger than our first number, we swap and continue. This way we access each number in the array once, for a total of N comparisons.

In the Parallel loop above, the algorithm creates overhead because each operation is wrapped inside a function call with a return value. So in addition to doing the comparisons, it is running overhead of adding and removing these calls onto the call stack. In addition, since each call is dependent on the value of the function call before, it needs to run in sequence.

In the Parallel For Loop below, the array is divided into an explicit number of threads determined by the variable threadNumber. This limits the overhead of function calls to a low number.

Note, for low values, the parallel loops performs slower. However, for 100M, there is a decrease in time elapsed.

static int FindLargestParallel(int[] arr)
{
    var answers = new ConcurrentBag<int>();
    int threadNumber = 4;

    int partitionSize = arr.Length/threadNumber;
    Parallel.For(0, /* starting number */
        threadNumber+1, /* Adding 1 to threadNumber in case array.Length not evenly divisible by threadNumber */
        i =>
        {
            if (i*partitionSize < arr.Length) /* check in case # in array is divisible by # threads */
            {
                var max = arr[i*partitionSize];
                for (var x = i*partitionSize; 
                    x < (i + 1)*partitionSize && x < arr.Length;
                    ++x)
                {
                    if (arr[x] > max)
                        max = arr[x];
                }
                answers.Add(max);
            }
        });

    /* note the shortcut in finding max in the bag */    
    return answers.Max(i=>i);
}

Answer 4

Some thoughts here: In the parallel case, there is thread management logic involved that determines how many threads it wants to use. This thread management logic presumably possibly runs on your main thread. Every time a thread returns with the new maximum value, the management logic kicks in and determines the next work item (the next number to process in your array). I'm pretty sure that this requires some kind of locking. In any case, determining the next item may even cost more than performing the comparison operation itself.

That sounds like a magnitude more work (overhead) to me than a single thread that processes one number after the other. In the single-threaded case there are a number of optimization at play: No boundary checks, CPU can load data into the first level cache within the CPU, etc. Not sure, which of these optimizations apply for the parallel case.

Keep in mind that on a typical desktop machine there are only 2 to 4 physical CPU cores available so you will never have more than that actually doing work. So if the parallel processing overhead is more than 2-4 times of a single-threaded operation, the parallel version will inevitably be slower, which you are observing.

Have you attempted to run this on a 32 core machine? ;-)

A better solution would be determine non-overlapping ranges (start + stop index) covering the entire array and let each parallel task process one range. This way, each parallel task can internally do a tight single-threaded loop and only return once the entire range has been processed. You could probably even determine a near optimal number of ranges based on the number of logical cores of the machine. I haven't tried this but I'm pretty sure you will see an improvement over the single-threaded case.

Answer 5

Try splitting the set into batches and running the batches in parallel, where the number of batches corresponds to your number of CPU cores. I ran some equations 1K, 10K and 1M times using the following methods:

A "for" loop.
A "Parallel.For" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.For" across 4 batches.
A "Parallel.ForEach" from the System.Threading.Tasks lib, across the entire set.
A "Parallel.ForEach" across 4 batches.

Results: (Measured in seconds)

Conclusion:
Processing batches in parallel using the "Parallel.ForEach" has the best outcome in cases above 10K records. I believe the batching helps because it utilizes all CPU cores (4 in this example), but also minimizes the amount of threading overhead associated with parallelization.

Here is my code:

        public void ParallelSpeedTest()
    {
        var rnd = new Random(56);
        int range = 1000000;
        int numberOfCores = 4;
        int batchSize = range / numberOfCores;
        int[] rangeIndexes = Enumerable.Range(0, range).ToArray();
        double[] inputs = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
        double[] weights = rangeIndexes.Select(n => rnd.NextDouble()).ToArray();
        double[] outputs = new double[rangeIndexes.Length];

        /// Series "for"...
        var startTimeSeries = DateTime.Now;
        for (var i = 0; i < range; i++)
        {
            outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
        }
        var durationSeries = DateTime.Now - startTimeSeries;

        /// "Parallel.For"...
        var startTimeParallel = DateTime.Now;
        Parallel.For(0, range, (i) => {
            outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
        });
        var durationParallelFor = DateTime.Now - startTimeParallel;

        /// "Parallel.For" in Batches...
        var startTimeParallel2 = DateTime.Now;
        Parallel.For(0, numberOfCores, (c) => {
            var endValue = (c == numberOfCores - 1) ? range : (c + 1) * batchSize;
            var startValue = c * batchSize;
            for (var i = startValue; i < endValue; i++)
            {
                outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
            }
        });
        var durationParallelForBatches = DateTime.Now - startTimeParallel2;

        /// "Parallel.ForEach"...
        var startTimeParallelForEach = DateTime.Now;
        Parallel.ForEach(rangeIndexes, (i) => {
            outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
        });
        var durationParallelForEach = DateTime.Now - startTimeParallelForEach;

        /// Parallel.ForEach in Batches...
        List<Tuple<int,int>> ranges = new List<Tuple<int, int>>();
        for (var i = 0; i < numberOfCores; i++)
        {
            int start = i * batchSize;
            int end = (i == numberOfCores - 1) ? range : (i + 1) * batchSize;
            ranges.Add(new Tuple<int,int>(start, end));
        }
        var startTimeParallelBatches = DateTime.Now;
        Parallel.ForEach(ranges, (range) => {
            for(var i = range.Item1; i < range.Item1; i++) {
                outputs[i] = Math.Sqrt(Math.Pow(inputs[i] * weights[i], 2));
            }
        });
        var durationParallelForEachBatches = DateTime.Now - startTimeParallelBatches;

        Debug.Print($"=================================================================");
        Debug.Print($"Given: Set-size: {range}, number-of-batches: {numberOfCores}, batch-size: {batchSize}");
        Debug.Print($".................................................................");
        Debug.Print($"Series For:                       {durationSeries}");
        Debug.Print($"Parallel For:                 {durationParallelFor}");
        Debug.Print($"Parallel For Batches:         {durationParallelForBatches}");
        Debug.Print($"Parallel ForEach:             {durationParallelForEach}");
        Debug.Print($"Parallel ForEach Batches:     {durationParallelForEachBatches}");
        Debug.Print($"");
    }

Parallel.ForEach slower than normal foreach

Question

5 answers

solution1
7 ACCPTED 2016-09-16 22:33:10

solution2
2 2016-09-16 22:51:29

solution3
1 2016-09-16 22:22:58

solution4
0 2016-09-16 21:52:56

solution5
0 2021-01-27 18:04:41

Parallel.ForEach slower than normal foreach

Question

5 answers

solution1 7 ACCPTED 2016-09-16 22:33:10

solution2 2 2016-09-16 22:51:29

solution3 1 2016-09-16 22:22:58

solution4 0 2016-09-16 21:52:56

solution5 0 2021-01-27 18:04:41

solution1
7 ACCPTED 2016-09-16 22:33:10

solution2
2 2016-09-16 22:51:29

solution3
1 2016-09-16 22:22:58

solution4
0 2016-09-16 21:52:56

solution5
0 2021-01-27 18:04:41