How does a sorting network beat generic sorting algorithms?

Question

In reference to fastest sort of fixed length 6 int array , I do not fully understand how this sorting network beats an algorithm like insertion sort .

Form that question, here is a comparison of the number of CPU cycles taken to complete the sort :

Linux 32 bits, gcc 4.4.1, Intel Core 2 Quad Q8300, -O2

Insertion Sort (Daniel Stutzbach) : 1425

Sorting Networks (Daniel Stutzbach) : 1080

The code used is as follows :

Insertion Sort (Daniel Stutzbach)

static inline void sort6_insertion_sort_v2(int *d){
    int i, j;
    for (i = 1; i < 6; i++) {
            int tmp = d[i];
            for (j = i; j >= 1 && tmp < d[j-1]; j--)
                    d[j] = d[j-1];
            d[j] = tmp;
    }
}

Sorting Networks (Daniel Stutzbach)

static inline void sort6_sorting_network_v1(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
    SWAP(1, 2);
    SWAP(0, 2);
    SWAP(0, 1);
    SWAP(4, 5);
    SWAP(3, 5);
    SWAP(3, 4);
    SWAP(0, 3);
    SWAP(1, 4);
    SWAP(2, 5);
    SWAP(2, 4);
    SWAP(1, 3);
    SWAP(2, 3);
#undef SWAP
}

I understand that sorting networks are really good for sorting in parallel, because some of the steps are independent of the other steps. But here we are not using the parallelization.

I expect it to be faster, as it has the advantage of knowing the exact number of elements beforehand. Where and why exactly does insertion sort make unnecessary comparisons?

EDIT1:

This is the input set these codes are compared against:

int d[6][6] = {\
    {1, 2, 3, 4, 5, 6},\
    {6, 5, 4, 3, 2, 1},\
    {100, 2, 300, 4, 500, 6},\
    {100, 2, 3, 4, 500, 6},\
    {1, 200, 3, 4, 5, 600},\
    {1, 1, 2, 1, 2, 1}\
};\

Answer 1

But here we are not using the parallelization.

Modern CPUs can figure out when instructions are independent and will execute them in parallel. Hence, even though there's only one thread, the sorting network's parallelism can be exploited.

Where exactly does insertion sort make unnecessary comparisons?

The easiest way to see the extra comparisons is to do an example by hand.

Insertion sort:
6 5 4 3 2 1
5 6 4 3 2 1
5 4 6 3 2 1
4 5 6 3 2 1
4 5 3 6 2 1
4 3 5 6 2 1
3 4 5 6 2 1
3 4 5 2 6 1
3 4 2 5 6 1
3 2 4 5 6 1
2 3 4 5 6 1
2 3 4 5 1 6
2 3 4 1 5 6
2 3 1 4 5 6
2 1 3 4 5 6
1 2 3 4 5 6

Sorting network:
6 5 4 3 2 1
6 4 5 3 2 1
5 4 6 3 2 1
4 5 6 3 2 1 # These three can execute in parallel with the first three
4 5 6 3 1 2 #
4 5 6 2 1 3 #
4 5 6 1 2 3
1 5 6 4 2 3
1 2 6 4 5 3
1 2 3 4 5 6
1 2 3 4 5 6

Answer 2

The better question is why the sorting network only outperforms insertion sort (generally a very slow sort) by ~50%. The answer is that big-O is not so important when n is tiny. As for OP's question, Daniel has the best answer.

Answer 3

我认为循环展开是导致排序网络算法更快结果的原因

Answer 4

I believe the amount of 'work' done in a parallel algorithm and a serial algorithm is always almost same. Only that since work gets distributed you would get outputs faster. I think you would get output convincingly faster in case when the size of input is sufficient enough to justify using parallel algorithm.

In case of insertion sort division of array amongst processors is such that it forms a pipeline, and it would take some time to fill the pipeline and then it would produce benefits of parallel algorithm.

Answer 5

Theoretically the code could be about the same if the compiler could fully unroll the loops in the Insertion Sort. The first loop can be easily unrolled, while the second can't be unrolled that easy.

It may also be the case that, because the code is not that simple as the network sorting code, the compiler can make less optimizations. I think there are more dependencies in the insertion sort than in the network sort, which may make a big difference when the compiler tries to optimize the code (correct me if I'm wrong).

Answer 6

I think all of you questions are answered in Daniel Stutzbach answer to the original post:

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

How does a sorting network beat generic sorting algorithms?

Question

6 answers

solution1
19 ACCPTED 2010-10-10 16:34:37

solution2
4 2010-10-10 17:48:45

solution3
1 2010-10-10 16:32:21

solution4
1 2010-10-10 17:00:24

solution5
0 2010-10-10 16:33:11

solution6
0 2010-10-10 16:35:45

How does a sorting network beat generic sorting algorithms?

Question

6 answers

solution1 19 ACCPTED 2010-10-10 16:34:37

solution2 4 2010-10-10 17:48:45

solution3 1 2010-10-10 16:32:21

solution4 1 2010-10-10 17:00:24

solution5 0 2010-10-10 16:33:11

solution6 0 2010-10-10 16:35:45

solution1
19 ACCPTED 2010-10-10 16:34:37

solution2
4 2010-10-10 17:48:45

solution3
1 2010-10-10 16:32:21

solution4
1 2010-10-10 17:00:24

solution5
0 2010-10-10 16:33:11

solution6
0 2010-10-10 16:35:45