有效实现多线程排序算法的关键是什么？天真的实现无法正常工作

Question

My first version two thread selection sort starts a new thread for every iteration. 我的第一个版本两线程选择排序为每次迭代启动一个新线程。

Some comparison(left - one thread, right - two thread version, time - ms): 一些比较（左-一个线程，右-两个线程版本，时间-ms）：

Size: 500       time 1  time 5
Size: 1000      time 1  time 9
Size: 1500      time 4  time 12
Size: 2000      time 5  time 16
Size: 2500      time 10 time 22
Size: 3000      time 14 time 26
Size: 3500      time 19 time 30
Size: 4000      time 24 time 36
Size: 4500      time 30 time 43
Size: 5000      time 37 time 49
Size: 5500      time 46 time 57
Size: 6000      time 55 time 66
Size: 6500      time 63 time 76
Size: 7000      time 74 time 80
Size: 7500      time 85 time 92
Size: 8000      time 96 time 102
Size: 8500      time 108        time 109
Size: 9000      time 122        time 124
Size: 9500      time 135        time 132
Size: 10000     time 150        time 144
Size: 10500     time 165        time 156
Size: 11000     time 181        time 174
Size: 11500     time 200        time 177
Size: 12000     time 218        time 195
Size: 12500     time 235        time 205
Size: 13000     time 255        time 214
Size: 13500     time 273        time 226
Size: 14000     time 296        time 245

After 9500 size array two threads work faster. 在9500大小的数组之后，两个线程的工作速度更快。 In my second implementation a thread start once. 在我的第二个实现中，线程启动一次。 But it has so unbelievable performance. 但是它具有如此令人难以置信的性能。

My cpu has 4 cores. 我的cpu有4个核心。

Size: 0 time 0  time 0
Size: 50        time 0  time 151
Size: 100       time 0  time 1276
Size: 150       time 0  time 2089
Size: 200       time 0  time 3925
Size: 250       time 0  time 5303

Code: 码：

//one thread
template<class ItType>
void selectionSortThreadsHelper2(ItType beg, ItType end)
{
    //sorting element by element
    for (auto it = beg; it != end; ++it) {

        ItType middleIt = it + std::distance(it, end) / 2;

        auto search = [&] { return std::min_element(it, middleIt); };
        //search
        std::future<ItType> minFirstHalfResult(std::async(std::launch::async , search));

        //wait searching
        ItType minSecondHalfIt = std::min_element(middleIt, end);
        ItType minFirstHalfIt = minFirstHalfResult.get();

        //swap if
        ItType minIt = *minFirstHalfIt < *minSecondHalfIt ? minFirstHalfIt : minSecondHalfIt;
        if (minIt != it)
            std::iter_swap(minIt, it);
    }
}

//two thread
template<class ItType>
void selectionSortThreadsHelper3(ItType beg, ItType end)
{
    bool quit = false;
    bool readyFlag = false;
    bool processed = false;
    std::mutex readyMutex;
    std::condition_variable readyCondVar;

    ItType it;
    ItType middleIt;
    ItType minFirstHalfIt;
    auto search = [&]() {
        while (true) {
            std::unique_lock<std::mutex> ul(readyMutex);
            readyCondVar.wait(ul, [&] {return readyFlag; });

            if (quit)
                return;

            minFirstHalfIt = std::min_element(it, middleIt);

            processed = true;

            ul.unlock();
            readyCondVar.notify_one();
        }
    };

    std::future<void> f(std::async(std::launch::async, search));

    //sorting element by element
    for (it = beg; it != end; ++it) {

        middleIt = it + std::distance(it, end) / 2;

        //say second thread to start searching
        {
            std::lock_guard<std::mutex> lg(readyMutex);
            readyFlag = true;
        }
        readyCondVar.notify_one();
        //std::this_thread::yield();

        ItType minSecondHalfIt = std::min_element(middleIt, end);

        //wait second thread
        {
            std::unique_lock<std::mutex> ul(readyMutex);
            readyCondVar.wait(ul, [&] { return processed; });
            processed = false;
            readyFlag = false;
        }
        readyCondVar.notify_all();

        //swap if
        ItType minIt = *minFirstHalfIt < *minSecondHalfIt ? minFirstHalfIt : minSecondHalfIt;
        if (minIt != it)
            std::iter_swap(minIt, it);
    }

    //quit thread
    {
        std::lock_guard<std::mutex> lg(readyMutex);
        readyFlag = true;
        quit = true;
    }
    readyCondVar.notify_all();

    f.get();
}

Answer 1

There are many ways to get parallel sort without writing your own. 有很多方法可以不用编写自己的代码就可以进行并行排序。 First, there is an experimental parallelism namespace that lets you say sort(par, data.begin(), data.end()) : http://en.cppreference.com/w/cpp/experimental/parallelism/existing#sort 首先，有一个实验性的并行性命名空间，可以让您说sort(par, data.begin(), data.end()) ： http : sort(par, data.begin(), data.end())

That namespace is being merged into the standard in C++17, so it should be in the std:: namespace at some point ( https://parallelstl.codeplex.com/ ). 该名称空间已被合并为C ++ 17中的标准，因此在某些时候它应该位于std::名称空间中（ https://parallelstl.codeplex.com/ ）。 There is also an older nonstandard GNU g++ parallel sort implementation based on OpenMP: https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html 还有一个基于OpenMP的较旧的非标准GNU g ++并行排序实现： https : //gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode.html

Finally, there are many pages online describing how to write your own parallel sort in C++11. 最后，在线上有很多页面描述如何在C ++ 11中编写自己的并行排序。 Try searching. 尝试搜索。 Here is a very comprehensive page: https://software.intel.com/en-us/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp 这是一个非常全面的页面： https : //software.intel.com/zh-cn/articles/a-parallel-stable-sort-using-c11-for-tbb-cilk-plus-and-openmp

Answer 2

Example multi-threaded bottom up merge sort, using Windows threading interface, in this case 4 threads meant for a processor with 4 (or more) cores. 使用Windows线程接口的示例多线程自底向上合并排序，在这种情况下，4个线程用于具有4个（或更多）内核的处理器。 Depending on the size of the array, it's about 3 times as fast as a single threaded merge sort, mostly due to the operations that occur within each core's local L1 and L2 cache. 根据数组的大小，它的速度大约是单线程合并排序的3倍，这主要是由于每个内核的本地L1和L2缓存中发生的操作所致。 The semaphores are used to start all threads at the same time for benchmarking purposes. 信号量用于同时启动所有线程以进行基准测试。 On my system (Intel 2600K 3.4ghz), it takes about 0.5 seconds to sort 16 million 32 bit integers, versus about 1.5 seconds for a single threaded merge sort. 在我的系统（Intel 2600K 3.4ghz）上，排序1600万个32位整数大约需要0.5秒，而单线程合并排序大约需要1.5秒。

#include <cstdlib>
#include <ctime>
#include <iostream>
#include <windows.h>

#define SIZE (16*1024*1024)             // must be multiple of 4

static HANDLE hs0;                      // semaphore handles
static HANDLE hs1;
static HANDLE hs2;
static HANDLE hs3;
static HANDLE ht1;                      // thread handles
static HANDLE ht2;
static HANDLE ht3;

static DWORD WINAPI Thread0(LPVOID);    // thread functions
static DWORD WINAPI Thread1(LPVOID);
static DWORD WINAPI Thread2(LPVOID);
static DWORD WINAPI Thread3(LPVOID);

static int  *pa;                        // pointers to buffers
static int  *pb;

void BottomUpMergeSort(int a[], int b[], size_t n);
void BottomUpMerge(int a[], int b[], size_t ll, size_t rr, size_t ee);
void BottomUpCopy(int a[], int b[], size_t ll, size_t rr);
size_t GetPassCount(size_t n);

int main()
{
int *array = new int[SIZE];
int *buffer = new int[SIZE];
clock_t ctTimeStart;                    // clock values
clock_t ctTimeStop;
    pa = array;
    pb = buffer;
    for(int i = 0; i < SIZE; i++){      // generate pseudo random data
        int r;
        r  = (((int)((rand()>>4) & 0xff))<< 0);
        r += (((int)((rand()>>4) & 0xff))<< 8);
        r += (((int)((rand()>>4) & 0xff))<<16);
        r += (((int)((rand()>>4) & 0x7f))<<24);
        array[i] = r;
    }

    hs0 = CreateSemaphore(NULL,0,1,NULL);
    hs1 = CreateSemaphore(NULL,0,1,NULL);
    hs2 = CreateSemaphore(NULL,0,1,NULL);
    hs3 = CreateSemaphore(NULL,0,1,NULL);
    ht1 = CreateThread(NULL, 0, Thread1, 0, 0, 0);
    ht2 = CreateThread(NULL, 0, Thread2, 0, 0, 0);
    ht3 = CreateThread(NULL, 0, Thread3, 0, 0, 0);

    ctTimeStart = clock();
    ReleaseSemaphore(hs0, 1, NULL);     // start sorts
    ReleaseSemaphore(hs1, 1, NULL);
    ReleaseSemaphore(hs2, 1, NULL);
    ReleaseSemaphore(hs3, 1, NULL);
    Thread0((LPVOID)NULL);
    WaitForSingleObject(ht2, INFINITE);
    // merge 1st and 2nd halves
    BottomUpMerge(pb, pa, 0, SIZE>>1, SIZE);
    ctTimeStop = clock();
    std::cout << "Number of ticks " << (ctTimeStop - ctTimeStart) << std::endl;

    for(int i = 1; i < SIZE; i++){      // check result 
        if(array[i-1] > array[i]){
            std::cout << "failed" << std::endl;
        }
    }
    CloseHandle(ht3);
    CloseHandle(ht2);
    CloseHandle(ht1);
    CloseHandle(hs3);
    CloseHandle(hs2);
    CloseHandle(hs1);
    CloseHandle(hs0);
    delete[] buffer;
    delete[] array;
    return 0;
}

static DWORD WINAPI Thread0(LPVOID lpvoid)
{
    WaitForSingleObject(hs0, INFINITE); // wait for semaphore
    // sort 1st quarter
    BottomUpMergeSort(pa + 0*(SIZE>>2), pb + 0*(SIZE>>2), SIZE>>2);
    WaitForSingleObject(ht1, INFINITE); // wait for thead 1
    // merge 1st and 2nd quarter
    BottomUpMerge(pa + 0*(SIZE>>1), pb + 0*(SIZE>>1), 0, SIZE>>2, SIZE>>1);
    return 0;
}

static DWORD WINAPI Thread1(LPVOID lpvoid)
{
    WaitForSingleObject(hs1, INFINITE); // wait for semaphore
    // sort 2nd quarter
    BottomUpMergeSort(pa + 1*(SIZE>>2), pb + 1*(SIZE>>2), SIZE>>2);
    return 0;
}

static DWORD WINAPI Thread2(LPVOID lpvoid)
{
    WaitForSingleObject(hs2, INFINITE); // wait for semaphore
    // sort 3rd quarter
    BottomUpMergeSort(pa + 2*(SIZE>>2), pb + 2*(SIZE>>2), SIZE>>2);
    WaitForSingleObject(ht3, INFINITE); // wait for thread 3
    // merge 3rd and 4th quarter
    BottomUpMerge(pa + 1*(SIZE>>1), pb + 1*(SIZE>>1), 0, SIZE>>2, SIZE>>1);
    return 0;
}

static DWORD WINAPI Thread3(LPVOID lpvoid)
{
    WaitForSingleObject(hs3, INFINITE); // wait for semaphore
    // sort 4th quarter
    BottomUpMergeSort(pa + 3*(SIZE>>2), pb + 3*(SIZE>>2), SIZE>>2);
    return 0;
}

void BottomUpMergeSort(int a[], int b[], size_t n)
{
size_t s = 1;                               // run size 
    if(GetPassCount(n) & 1){                // if odd number of passes
        for(s = 1; s < n; s += 2)           // swap in place for 1st pass
            if(a[s] < a[s-1])
                std::swap(a[s], a[s-1]);
        s = 2;
    }
    while(s < n){                           // while not done
        size_t ee = 0;                      // reset end index
        while(ee < n){                      // merge pairs of runs
            size_t ll = ee;                 // ll = start of left  run
            size_t rr = ll+s;               // rr = start of right run
            if(rr >= n){                    // if only left run
                rr = n;
                BottomUpCopy(a, b, ll, rr); //   copy left run
                break;                      //   end of pass
            }
            ee = rr+s;                      // ee = end of right run
            if(ee > n)
                ee = n;
            BottomUpMerge(a, b, ll, rr, ee);
        }
        std::swap(a, b);                    // swap a and b
        s <<= 1;                            // double the run size
    }
}

void BottomUpMerge(int a[], int b[], size_t ll, size_t rr, size_t ee)
{
    size_t o = ll;                          // b[]       index
    size_t l = ll;                          // a[] left  index
    size_t r = rr;                          // a[] right index
    while(1){                               // merge data
        if(a[l] <= a[r]){                   // if a[l] <= a[r]
            b[o++] = a[l++];                //   copy a[l]
            if(l < rr)                      //   if not end of left run
                continue;                   //     continue (back to while)
            do                              //   else copy rest of right run
                b[o++] = a[r++];
            while(r < ee);
            break;                          //     and return
        } else {                            // else a[l] > a[r]
            b[o++] = a[r++];                //   copy a[r]
            if(r < ee)                      //   if not end of right run
                continue;                   //     continue (back to while)
            do                              //   else copy rest of left run
                b[o++] = a[l++];
            while(l < rr);
            break;                          //     and return
        }
    }
}

void BottomUpCopy(int a[], int b[], size_t ll, size_t rr)
{
    do                                      // copy left run
        b[ll] = a[ll];
    while(++ll < rr);
}

size_t GetPassCount(size_t n)               // return # passes
{
    size_t i = 0;
    for(size_t s = 1; s < n; s <<= 1)
        i += 1;
    return(i);
}

有效实现多线程排序算法的关键是什么？天真的实现无法正常工作

问题描述

2 个解决方案

解决方案1
2 2016-07-10 18:59:30

解决方案2
2 2016-07-10 19:49:03

有效实现多线程排序算法的关键是什么？ 天真的实现无法正常工作

问题描述

2 个解决方案

解决方案1 2 2016-07-10 18:59:30

解决方案2 2 2016-07-10 19:49:03

有效实现多线程排序算法的关键是什么？天真的实现无法正常工作

解决方案1
2 2016-07-10 18:59:30

解决方案2
2 2016-07-10 19:49:03