如何在C中使用pthread計數單詞出現的次數？

Question

我用C語言編寫了一個程序，以計算文件中每個單詞的所有單詞出現次數，並對它們進行排序，以將出現次數最多的單詞顯示為出現次數最少的單詞。 但是，我需要使用pthread創建多個線程，具體取決於在命令行中作為參數輸入的數字。 該文件需要划分為輸入的線程數。 例如，在命令行中輸入4作為參數。 然后，需要使用新線程將文件分為四個部分。 然后，這四個部分需要重新結合在一起。 我的C不太好，我不知道該怎么做。 誰能幫忙嗎？ 一個例子會很好。

到目前為止，這是我的代碼：

    int main(int argc, char **argv) {
       struct stat fileStat;
       FILE *out;
       char *address;
       int size, res, file, num_threads;

       list_t *words = (list_t *)malloc(sizeof(list_t)); 

       res = access(argv[1], F_OK);
       if (result != 0) {
          exit(1);
       }

       stat(argv[1], &fileStat);

       // Check if a file.
       if (S_ISREG(fileStat.st_mode)) {
          file = open(argv[1], O_RDONLY);
          if (file < 0)
             exit(1);
         // Check the total size of the file
         size = fileStat.st_size; 
         num_threads = atoi(argv[2]); 
         if ((addr = mmap(0, size, PROT_READ, MAP_SHARED , file, 0)) == (void *) -1) {
            exit(1);
         }
         munmap(addr, size);
         close(file);
      } else {
         exit(1);
      }

Answer 1

多個線程可以安全地讀取源文件，而不會出現問題。 寫作是您遇到問題的時候。

我的建議（沒有真正理解需求）是：

在啟動時確定文件大小
計算值大小/線程數
假設文件大小為4k，則每個線程的值約為1k
搜尋文件大小為1的塊，讀取單個字節，直到找到單詞分隔符
此位置是線程1的區域的結尾和線程2的開始
尋找第二和第三塊大小並執行相同的操作
此時，每個線程都有一個文件開始和結束位置
啟動每個線程，並將它們負責的位置傳遞給他們
使用互斥技術使哈希表（或用於計數正在使用的單詞的任何方法）安全，並且僅使每個線程都增加找到的任何單詞的計數
完成所有線程后，您便擁有了列表

Answer 2

這里的想法是將工作划分為多個線程，然后再將各個部分連接起來，則執行相同操作的速度要快得多。 因此，您需要：

將工作分成許多部分，而不會浪費太多時間
探索輕松加入工作的方式
解決工作划分帶來的邊界問題

第一部分很簡單。 只需在線程之間平均分配數據即可。

第二部分也很容易。 只是對結果求和。

棘手的部分是零件號3。在您的情況下，您可能最終將單詞分為兩個不同的線程。 因此，為避免計算“半字”，您必須為每個線程的第一個/最后一個字保留單獨的記錄。 然后，當獲得所有結果時，您可以獲取線程N的最后一個單詞，並將其與線程N + 1的第一個單詞連接起來，然后才將該單詞添加到計數中。 顯然，如果分隔符（空格，enter，...）是線程找到的第一個/最后一個字符，則您各自的第一個/最后一個字將為空。

用偽代碼：

def main:
    size = filesize
    ptr = mmap(file)
    num_threads = 4

    for i in range(1, num_threads):
        new_thread(exec = count_words,
                   start = ptr + i * size / num_threads,
                   length = size / num_threads)

    wait_for_result(all_threads)
    join_the_results

def count_words(start, length):
    # Count words as if it was an entire file
    # But store separatelly the first/last word if
    # the segment does not start/ends with an word
    # separator(" ", ".", ",", "\n", etc...)
    return (count_of_words, first_word, last_word)

這與MapReduce背后的想法相同。

Answer 3

這不是完美的邏輯代碼。 我用過C ++。 如果您非常喜歡C，則可以使用POSIX線程代替std :: thread。 另外，我剛剛將整個文件的大小分為線程數。 您將不得不在最后一個線程本身中處理最后一個數據塊（剩余的數據除以線程數）。 我還沒做

另一點是我從線程獲取返回值的方式。 到目前為止，我將其保存到全局數組。 C ++ 11支持檢索返回值-C ++：來自std :: thread的簡單返回值？

#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
using namespace std;

#define NO_OF_THREADS 4

int countArray[100];
std::mutex g_pages_mutex;

int trimWhiteSpaces(char *arr, int start, int len)
{
    int i = 0;
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            continue;
        }
        else
            break;
    }

    return i;
}

void getWordCount(char *arr, int len, int ID)
{

    int count = 0;
    bool isSpace = false;
    int i = 0;
    i = i + trimWhiteSpaces(arr, i, len);
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            i = i + trimWhiteSpaces(&arr[i], i, len) - 1;
            //printf("Found space");
            isSpace = true;
            count++;
        }
        else
        {
            isSpace = false;
        }
    }

    if(isSpace)
        count = count - 1;
    count = count + 1;
    g_pages_mutex.lock();
    cout << "MYCOUNT:" << count << "\n";
    countArray[ID] = count;
    g_pages_mutex.unlock();
}

int main(int argc, const char * argv[])
{
    char fileData[5000];
    std::thread threadIDs[100];
    int noOfThreads = NO_OF_THREADS;
    char *filePath = "/Users/abc/Desktop/test.txt";
    int read_sz = 0;
    int decrements = 0;
    bool previousNotEndsInSpace = false;

    std::ifstream is(filePath, std::ifstream::ate | std::ifstream::binary);
    int fileSize = is.tellg();
    int bulkSize = fileSize / NO_OF_THREADS;
    is.seekg(0);


    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        int old_read_sz = read_sz;
        is.read(fileData, bulkSize);
        read_sz = is.tellg();
        fileData[read_sz - old_read_sz] = '\0';
        if(read_sz > 0)
        {
            cout << " data size so far: " << read_sz << "\n";
            cout << fileData << endl;
            if(previousNotEndsInSpace && fileData[0] != ' ')
            {
                decrements = decrements + 1;
            }

            if(fileData[read_sz - 1] != ' ')
            {
                previousNotEndsInSpace = true;
            }
            else
            {
                previousNotEndsInSpace = false;
            }
            //getWordCount(fileData, strlen(fileData), iter);
            threadIDs[iter] = std::thread(getWordCount, fileData, strlen(fileData), iter);
        }
    }

    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        threadIDs[iter].join();
    }

    int totalCount = 0;
    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        cout << "COUNT: " << countArray[iter] << "\n";
        totalCount = totalCount + countArray[iter];
    }

    cout << "TOTAL: " << totalCount - decrements << "\n";
    return 0;
}

如何在C中使用pthread計數單詞出現的次數？

問題描述

3 個解決方案

解決方案1
1 2014-12-05 21:20:04

解決方案2
0 已采納 2014-12-11 16:59:49

解決方案3
0 2014-12-11 17:17:41

如何在C中使用pthread計數單詞出現的次數？

問題描述

3 個解決方案

解決方案1 1 2014-12-05 21:20:04

解決方案2 0 已采納 2014-12-11 16:59:49

解決方案3 0 2014-12-11 17:17:41

解決方案1
1 2014-12-05 21:20:04

解決方案2
0 已采納 2014-12-11 16:59:49

解決方案3
0 2014-12-11 17:17:41