简体   繁体   中英

How to use pthread in C to count the number of word occurrences?

I have written a program in C to count all the word occurrences of each word in a file and sort them to display the most occurring words to the least occurring words. However, I need to use pthread to create multiple threads, depending on the number entered as an argument in the command line. The file needs to be split up into the number of threads entered. For example, say 4 was entered in the command line as an argument. The file would then need to be split up into four parts with each part using a new thread. Then the four parts would need to be joined back together. My C is not very good and I am lost on how to do this. Can anyone please help with this? An example would be great.

Here is my code so far:

    int main(int argc, char **argv) {
       struct stat fileStat;
       FILE *out;
       char *address;
       int size, res, file, num_threads;

       list_t *words = (list_t *)malloc(sizeof(list_t)); 

       res = access(argv[1], F_OK);
       if (result != 0) {
          exit(1);
       }

       stat(argv[1], &fileStat);

       // Check if a file.
       if (S_ISREG(fileStat.st_mode)) {
          file = open(argv[1], O_RDONLY);
          if (file < 0)
             exit(1);
         // Check the total size of the file
         size = fileStat.st_size; 
         num_threads = atoi(argv[2]); 
         if ((addr = mmap(0, size, PROT_READ, MAP_SHARED , file, 0)) == (void *) -1) {
            exit(1);
         }
         munmap(addr, size);
         close(file);
      } else {
         exit(1);
      }

Multiple threads can safely read a source file without issue. Writing is when you have problems.

My suggestion (without really understanding the requirements) is:

  • On launch determine the file size
  • calculate value size/threadcount
  • lets say the file is 4k in size, you get a value of about 1k per thread
  • seek into the file 1 chunk size, read single bytes until you find a word separator
  • This position is the end of thread 1's zone and the start of thread 2
  • Seek to the second and 3rd chunk sizes and do the same
  • At this point you have a file start and finish position for each thread
  • Launch each thread and pass them the positions they will be responsible for covering
  • Make the hashtable (or whatever method for counting words you are using) thread safe with mutual exclusion techniques and just have each thread add to the count of whatever word is found
  • Once all threads are done you have your list

The idea here is that dividing the work in multiple threads and joining the parts afterwards, it is much faster to do the same operation. So you need to:

  1. Divide the work in many parts without wasting too much time
  2. Discover a way of joining back the work easily
  3. Solve the problem in the boundaries caused by dividing the work

The first part is simple. Just divide your data equally between threads.

The second part is easy too. Just sum the results.

The tricky part is part number 3. In your case, you could end up with a word divided between two different threads. So to avoid counting "half-words" you must maintain an separate record for the first/last word of every thread. Then, when you have all your results, you can get the last word of thread N and concatenate it with the first word of thread N+1 and only then add the word to the count. Obviusly if an separator(space, enter, ...) is the first/last char found by a thread, your respective first/last word will be empty.

In pseudo-code:

def main:
    size = filesize
    ptr = mmap(file)
    num_threads = 4

    for i in range(1, num_threads):
        new_thread(exec = count_words,
                   start = ptr + i * size / num_threads,
                   length = size / num_threads)

    wait_for_result(all_threads)
    join_the_results

def count_words(start, length):
    # Count words as if it was an entire file
    # But store separatelly the first/last word if
    # the segment does not start/ends with an word
    # separator(" ", ".", ",", "\n", etc...)
    return (count_of_words, first_word, last_word)

This is the same idea behind MapReduce .

This is not a perfect-logic code. I have used C++. If you are very particular with C, you can use POSIX threads instead of std::thread. Also I have just divided the whole file size into number of threads. You will have to take care of last chunk of data(remaining from divide by number of threads), in the last thread itself. I haven't done this.

Another point is the way I am getting the return values from the threads. As of now I am saving it to a global array. C++11 supports retrieving return values - C++: Simple return value from std::thread?

#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
using namespace std;

#define NO_OF_THREADS 4

int countArray[100];
std::mutex g_pages_mutex;

int trimWhiteSpaces(char *arr, int start, int len)
{
    int i = 0;
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            continue;
        }
        else
            break;
    }

    return i;
}

void getWordCount(char *arr, int len, int ID)
{

    int count = 0;
    bool isSpace = false;
    int i = 0;
    i = i + trimWhiteSpaces(arr, i, len);
    for(; i < len; i++)
    {
        char c = arr[i];
        if(c == ' ')
        {
            i = i + trimWhiteSpaces(&arr[i], i, len) - 1;
            //printf("Found space");
            isSpace = true;
            count++;
        }
        else
        {
            isSpace = false;
        }
    }

    if(isSpace)
        count = count - 1;
    count = count + 1;
    g_pages_mutex.lock();
    cout << "MYCOUNT:" << count << "\n";
    countArray[ID] = count;
    g_pages_mutex.unlock();
}

int main(int argc, const char * argv[])
{
    char fileData[5000];
    std::thread threadIDs[100];
    int noOfThreads = NO_OF_THREADS;
    char *filePath = "/Users/abc/Desktop/test.txt";
    int read_sz = 0;
    int decrements = 0;
    bool previousNotEndsInSpace = false;

    std::ifstream is(filePath, std::ifstream::ate | std::ifstream::binary);
    int fileSize = is.tellg();
    int bulkSize = fileSize / NO_OF_THREADS;
    is.seekg(0);


    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        int old_read_sz = read_sz;
        is.read(fileData, bulkSize);
        read_sz = is.tellg();
        fileData[read_sz - old_read_sz] = '\0';
        if(read_sz > 0)
        {
            cout << " data size so far: " << read_sz << "\n";
            cout << fileData << endl;
            if(previousNotEndsInSpace && fileData[0] != ' ')
            {
                decrements = decrements + 1;
            }

            if(fileData[read_sz - 1] != ' ')
            {
                previousNotEndsInSpace = true;
            }
            else
            {
                previousNotEndsInSpace = false;
            }
            //getWordCount(fileData, strlen(fileData), iter);
            threadIDs[iter] = std::thread(getWordCount, fileData, strlen(fileData), iter);
        }
    }

    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        threadIDs[iter].join();
    }

    int totalCount = 0;
    for(int iter = 0; iter < NO_OF_THREADS; iter++)
    {
        cout << "COUNT: " << countArray[iter] << "\n";
        totalCount = totalCount + countArray[iter];
    }

    cout << "TOTAL: " << totalCount - decrements << "\n";
    return 0;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM