简体   繁体   中英

C programming: How to use mmap(2) to read file parallel by multiple threads?

I am trying to write multi-threaded code to read file in fixed chunks using mmap(2) and counts the words. Each thread works on a separate portion of the file, making faster processing of the file. I am able to read the file using mmap(2) single threaded. When the number of threads is more than one, it fails with a segmentation fault.

for( unsigned long cur_pag_num = 0; cur_pag_num < total_blocks; cur_pag_num++ ) {
    mmdata = mmap(
        NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, (fileOffset + (cur_pag_num * PAGE_SIZE))
    );

    if (mmdata == MAP_FAILED) printf(" mmap error ");

    unsigned  long wc = getWordCount( mmdata );
    parserParam->wordCount +=wc;
    munmap( mmdata, PAGE_SIZE );
}

unsigned long getWordCount(char *page){
     unsigned long wordCount=0;
     for(long i = 0 ; page[i] ;i++ ){
        if(page[i]==' ' || page[i]=='\n')
            wordCount++;
     }
     return wordCount;
}

I have figured out that code fails inside getWordCount(mmdata) . What am I doing wrong here?

Note: size of file is more than the size of main memory. So reading in fixed size chunks ( PAGE_SIZE ).

getWordCount is accessing outside the mapped page, because the loop stops when it finds a null byte. But mmap() doesn't add a null byte after the mapped page. You need to pass the size of the mapped page to the function. It should stop when it reaches either that index or a null byte (if the file isn't long enough to fill the page, the rest of the page will be zeros).

for( unsigned long cur_pag_num = 0; cur_pag_num < total_blocks; cur_pag_num++ ) {
    mmdata = mmap(
        NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, (fileOffset + (cur_pag_num * PAGE_SIZE))
    );

    if (mmdata == MAP_FAILED) printf(" mmap error ");

    unsigned  long wc = getWordCount( mmdata, PAGE_SIZE );
    parserParam->wordCount +=wc;
    munmap( mmdata, PAGE_SIZE );
}

unsigned long getWordCount(char *page, size){
     unsigned long wordCount=0;
     for(long i = 0 ; i < size && page[i] ;i++ ){
        if(page[i]==' ' || page[i]=='\n')
            wordCount++;
     }
     return wordCount;
}

BTW, there's another problem with your approach: a word that spans page boundaries will be counted twice.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM