I have written a program in C to count all the word occurrences of each word in a file and sort them to display the most occurring words to the least occurring words. However, I need to use pthread to create multiple threads, depending on the number entered as an argument in the command line. The file needs to be split up into the number of threads entered. For example, say 4 was entered in the command line as an argument. The file would then need to be split up into four parts with each part using a new thread. Then the four parts would need to be joined back together. My C is not very good and I am lost on how to do this. Can anyone please help with this? An example would be great.
Here is my code so far:
int main(int argc, char **argv) {
struct stat fileStat;
FILE *out;
char *address;
int size, res, file, num_threads;
list_t *words = (list_t *)malloc(sizeof(list_t));
res = access(argv[1], F_OK);
if (result != 0) {
exit(1);
}
stat(argv[1], &fileStat);
// Check if a file.
if (S_ISREG(fileStat.st_mode)) {
file = open(argv[1], O_RDONLY);
if (file < 0)
exit(1);
// Check the total size of the file
size = fileStat.st_size;
num_threads = atoi(argv[2]);
if ((addr = mmap(0, size, PROT_READ, MAP_SHARED , file, 0)) == (void *) -1) {
exit(1);
}
munmap(addr, size);
close(file);
} else {
exit(1);
}
Multiple threads can safely read a source file without issue. Writing is when you have problems.
My suggestion (without really understanding the requirements) is:
The idea here is that dividing the work in multiple threads and joining the parts afterwards, it is much faster to do the same operation. So you need to:
The first part is simple. Just divide your data equally between threads.
The second part is easy too. Just sum the results.
The tricky part is part number 3. In your case, you could end up with a word divided between two different threads. So to avoid counting "half-words" you must maintain an separate record for the first/last word of every thread. Then, when you have all your results, you can get the last word of thread N and concatenate it with the first word of thread N+1 and only then add the word to the count. Obviusly if an separator(space, enter, ...) is the first/last char found by a thread, your respective first/last word will be empty.
In pseudo-code:
def main:
size = filesize
ptr = mmap(file)
num_threads = 4
for i in range(1, num_threads):
new_thread(exec = count_words,
start = ptr + i * size / num_threads,
length = size / num_threads)
wait_for_result(all_threads)
join_the_results
def count_words(start, length):
# Count words as if it was an entire file
# But store separatelly the first/last word if
# the segment does not start/ends with an word
# separator(" ", ".", ",", "\n", etc...)
return (count_of_words, first_word, last_word)
This is the same idea behind MapReduce .
This is not a perfect-logic code. I have used C++. If you are very particular with C, you can use POSIX threads instead of std::thread. Also I have just divided the whole file size into number of threads. You will have to take care of last chunk of data(remaining from divide by number of threads), in the last thread itself. I haven't done this.
Another point is the way I am getting the return values from the threads. As of now I am saving it to a global array. C++11 supports retrieving return values - C++: Simple return value from std::thread?
#include <iostream>
#include <fstream>
#include <thread>
#include <mutex>
using namespace std;
#define NO_OF_THREADS 4
int countArray[100];
std::mutex g_pages_mutex;
int trimWhiteSpaces(char *arr, int start, int len)
{
int i = 0;
for(; i < len; i++)
{
char c = arr[i];
if(c == ' ')
{
continue;
}
else
break;
}
return i;
}
void getWordCount(char *arr, int len, int ID)
{
int count = 0;
bool isSpace = false;
int i = 0;
i = i + trimWhiteSpaces(arr, i, len);
for(; i < len; i++)
{
char c = arr[i];
if(c == ' ')
{
i = i + trimWhiteSpaces(&arr[i], i, len) - 1;
//printf("Found space");
isSpace = true;
count++;
}
else
{
isSpace = false;
}
}
if(isSpace)
count = count - 1;
count = count + 1;
g_pages_mutex.lock();
cout << "MYCOUNT:" << count << "\n";
countArray[ID] = count;
g_pages_mutex.unlock();
}
int main(int argc, const char * argv[])
{
char fileData[5000];
std::thread threadIDs[100];
int noOfThreads = NO_OF_THREADS;
char *filePath = "/Users/abc/Desktop/test.txt";
int read_sz = 0;
int decrements = 0;
bool previousNotEndsInSpace = false;
std::ifstream is(filePath, std::ifstream::ate | std::ifstream::binary);
int fileSize = is.tellg();
int bulkSize = fileSize / NO_OF_THREADS;
is.seekg(0);
for(int iter = 0; iter < NO_OF_THREADS; iter++)
{
int old_read_sz = read_sz;
is.read(fileData, bulkSize);
read_sz = is.tellg();
fileData[read_sz - old_read_sz] = '\0';
if(read_sz > 0)
{
cout << " data size so far: " << read_sz << "\n";
cout << fileData << endl;
if(previousNotEndsInSpace && fileData[0] != ' ')
{
decrements = decrements + 1;
}
if(fileData[read_sz - 1] != ' ')
{
previousNotEndsInSpace = true;
}
else
{
previousNotEndsInSpace = false;
}
//getWordCount(fileData, strlen(fileData), iter);
threadIDs[iter] = std::thread(getWordCount, fileData, strlen(fileData), iter);
}
}
for(int iter = 0; iter < NO_OF_THREADS; iter++)
{
threadIDs[iter].join();
}
int totalCount = 0;
for(int iter = 0; iter < NO_OF_THREADS; iter++)
{
cout << "COUNT: " << countArray[iter] << "\n";
totalCount = totalCount + countArray[iter];
}
cout << "TOTAL: " << totalCount - decrements << "\n";
return 0;
}
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.