Mergesort pThread實現與單線程花費的時間相同

Question

（我已經盡力簡化了，以找出我做錯了什么。）

代碼的想法是我有一個全局數組* v（我希望使用該數組不會減慢速度，線程永遠不應該獲得相同的值，因為它們都在不同的范圍內工作），我嘗試創建2個線程每一個都通過調用帶有相應參數的功能merge_sort（）對前半部分和后半部分進行排序。

在線程運行中，我看到進程的CPU使用率達到80-100％（在雙核cpu上），而在無線程運行時，它僅保持50％，但運行時間非常接近。

這是（相關的）代碼：

//這是2個排序函數，每個線程將調用merge_sort（..）。 這有問題嗎？ 兩個線程都調用相同（正常）功能？

void merge (int *v, int start, int middle, int end) {
    //dynamically creates 2 new arrays for the v[start..middle] and v[middle+1..end]
    //copies the original values into the 2 halves
    //then sorts them back into the v array
}

void merge_sort (int *v, int start, int end) {
    //recursively calls merge_sort(start, (start+end)/2) and merge_sort((start+end)/2+1, end) to sort them
    //calls merge(start, middle, end) 
}

//在這里，我希望創建每個線程並在其特定范圍內調用merge_sort（這是原始代碼的簡化版本，可以更輕松地發現錯誤）

void* mergesort_t2(void * arg) {
    t_data* th_info = (t_data*)arg;
    merge_sort(v, th_info->a, th_info->b);
    return (void*)0;
}

//主要，我只是創建了兩個調用上述函數的線程

int main (int argc, char* argv[])
{
    //some stuff

    //getting the clock to calculate run time
    clock_t t_inceput, t_sfarsit;
    t_inceput = clock();

    //ignore crt_depth for this example (in the full code i'm recursively creating new threads and i need this to know when to stop)
    //the a and b are the range of values the created thread will have to sort
    pthread_t thread[2];
    t_data next_info[2];
    next_info[0].crt_depth = 1;
    next_info[0].a = 0;
    next_info[0].b = n/2;
    next_info[1].crt_depth = 1;
    next_info[1].a = n/2+1;
    next_info[1].b = n-1;

    for (int i=0; i<2; i++) {
        if (pthread_create (&thread[i], NULL, &mergesort_t2, &next_info[i]) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    for (int i=0; i<2; i++) {
        if (pthread_join(thread[i], &status) != 0) {
            cerr<<"error\n;";
            return err;
        }
    }

    //now i merge the 2 sorted halves
    merge(v, 0, n/2, n-1);

    //calculate end time
    t_sfarsit = clock();

    cout<<"Sort time (s): "<<double(t_sfarsit - t_inceput)/CLOCKS_PER_SEC<<endl;
    delete [] v;
}

產出（百萬價值）：

Sort time (s): 1.294

直接調用merge_sort的輸出，沒有線程：

Sort time (s): 1.388

產出（價值一千萬）：

Sort time (s): 12.75

直接調用merge_sort的輸出，沒有線程：

Sort time (s): 13.838

解：

我還要感謝WhozCraig和Adam，因為他們從一開始就暗示了這一點。

我使用的是inplace_merge(..)函數，而不是我自己的函數，程序的運行時間與現在一樣。

這是我的初始合並功能（不確定初始是否可以修改，此后我可能已經修改了幾次，數組索引現在也可能是錯誤的，我在[a，b]和[a，b之間來回切換），這只是最后一個已注釋掉的版本）：

void merge (int *v, int a, int m, int c) { //sorts v[a,m] - v[m+1,c] in v[a,c]

    //create the 2 new arrays
    int *st = new int[m-a+1];
    int *dr = new int[c-m+1];
    //copy the values
    for (int i1 = 0; i1 <= m-a; i1++)
        st[i1] = v[a+i1];
    for (int i2 = 0; i2 <= c-(m+1); i2++)
        dr[i2] = v[m+1+i2];

    //merge them back together in sorted order
    int is=0, id=0;
    for (int i=0; i<=c-a; i++)  {
        if (id+m+1 > c || (a+is <= m && st[is] <= dr[id])) {
            v[a+i] = st[is];
            is++;
        }
        else {
            v[a+i] = dr[id];
            id++;
        }
    }
    delete st, dr;
}

所有這些都被替換為：

inplace_merge(v+a, v+m, v+c);

在我的3GHz雙核CPU上進行編輯：

1百萬個值：1個線程：7.236 s 2個線程：4.622 s 4個線程：4.692 s

1000萬個值：1個線程：82.034 s 2個線程：46.189 s 4個線程：47.36 s

Answer 1

注意：由於OP使用Windows，以下我的回答（錯誤地假定為Linux）可能不適用。 我將其保留下來是為了那些可能會覺得有用的信息。

clock()是用於在Linux上測量時間的錯誤接口：它測量程序使用的CPU時間（請參閱http://linux.die.net/man/3/clock ），如果有多個線程，則該時間為所有線程的CPU時間。 您需要測量經過時間或掛鍾時間。 請參閱此SO問題的更多詳細信息： C：使用clock（）來測量多線程程序中的時間，這還告訴您可以使用哪種API代替clock() 。

在您嘗試與之進行比較的基於MPI的實現中，使用了兩個不同的進程（MPI通常啟用並發性），並且不包括第二個進程的CPU時間-因此，CPU時間接近壁鍾時間。 然而，即使在串行程序中，使用CPU時間（和clock() ）進行性能測量仍然是錯誤的。 由於一個原因，如果程序等待例如網絡事件或來自另一個MPI進程的消息，它仍會花費時間-而不會花費CPU時間。

更新：在Microsoft C運行時庫的實現中， clock()返回wall-clock time ，因此可以用於您的目的。 目前還不清楚是否使用Microsoft的工具鏈或其他工具，例如Cygwin或MinGW。

Answer 2

有一件令我震驚的事情：“動態創建2個新數組[...]”。 由於兩個線程都需要系統內存，因此它們需要為此獲取一個鎖，這很可能是您的瓶頸。 特別是，進行微觀陣列分配的想法聽起來效率極低。 有人建議就地排序，不需要任何額外的存儲，這樣可以提高性能。

另一件事是，對於任何big-O復雜度測量，經常被忘記的開始半句：“有一個n0，因此對於所有n> n0 ...”。 換句話說，也許您還沒有達到n0？ 最近，我看了一段視頻（希望其他人會記住它），其中有人嘗試確定某些算法的限制，結果是這些限制令人驚訝地很高 。

Mergesort pThread實現與單線程花費的時間相同

問題描述

2 個解決方案

解決方案1
0 2014-06-10 15:22:25

解決方案2
0 已采納 2014-06-10 18:51:45

Mergesort pThread實現與單線程花費的時間相同

問題描述

2 個解決方案

解決方案1 0 2014-06-10 15:22:25

解決方案2 0 已采納 2014-06-10 18:51:45

解決方案1
0 2014-06-10 15:22:25

解決方案2
0 已采納 2014-06-10 18:51:45