简体   繁体   English

C编程:如何使用mmap(2)通过多个线程并行读取文件?

[英]C programming: How to use mmap(2) to read file parallel by multiple threads?

I am trying to write multi-threaded code to read file in fixed chunks using mmap(2) and counts the words. 我正在尝试编写多线程代码以使用mmap(2)读取固定块中的文件并计算字数。 Each thread works on a separate portion of the file, making faster processing of the file. 每个线程都在文件的单独部分上工作,从而可以更快地处理文件。 I am able to read the file using mmap(2) single threaded. 我可以使用mmap(2)单线程读取文件。 When the number of threads is more than one, it fails with a segmentation fault. 当线程数超过一个时,它将失败并出现分段错误。

for( unsigned long cur_pag_num = 0; cur_pag_num < total_blocks; cur_pag_num++ ) {
    mmdata = mmap(
        NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, (fileOffset + (cur_pag_num * PAGE_SIZE))
    );

    if (mmdata == MAP_FAILED) printf(" mmap error ");

    unsigned  long wc = getWordCount( mmdata );
    parserParam->wordCount +=wc;
    munmap( mmdata, PAGE_SIZE );
}

unsigned long getWordCount(char *page){
     unsigned long wordCount=0;
     for(long i = 0 ; page[i] ;i++ ){
        if(page[i]==' ' || page[i]=='\n')
            wordCount++;
     }
     return wordCount;
}

I have figured out that code fails inside getWordCount(mmdata) . 我发现getWordCount(mmdata)内的代码失败。 What am I doing wrong here? 我在这里做错了什么?

Note: size of file is more than the size of main memory. 注意:文件大小大于主存储器的大小。 So reading in fixed size chunks ( PAGE_SIZE ). 因此,读取固定大小的块( PAGE_SIZE )。

getWordCount is accessing outside the mapped page, because the loop stops when it finds a null byte. getWordCount正在映射的页面外部访问,因为循环在找到空字节时停止。 But mmap() doesn't add a null byte after the mapped page. 但是mmap()不会在映射页面之后添加一个空字节。 You need to pass the size of the mapped page to the function. 您需要将映射页面的大小传递给函数。 It should stop when it reaches either that index or a null byte (if the file isn't long enough to fill the page, the rest of the page will be zeros). 当到达该索引或空字节时,它应该停止(如果文件的长度不足以填满页面,则页面的其余部分将为零)。

for( unsigned long cur_pag_num = 0; cur_pag_num < total_blocks; cur_pag_num++ ) {
    mmdata = mmap(
        NULL, PAGE_SIZE, PROT_READ, MAP_PRIVATE, fd, (fileOffset + (cur_pag_num * PAGE_SIZE))
    );

    if (mmdata == MAP_FAILED) printf(" mmap error ");

    unsigned  long wc = getWordCount( mmdata, PAGE_SIZE );
    parserParam->wordCount +=wc;
    munmap( mmdata, PAGE_SIZE );
}

unsigned long getWordCount(char *page, size){
     unsigned long wordCount=0;
     for(long i = 0 ; i < size && page[i] ;i++ ){
        if(page[i]==' ' || page[i]=='\n')
            wordCount++;
     }
     return wordCount;
}

BTW, there's another problem with your approach: a word that spans page boundaries will be counted twice. 顺便说一句,您的方法还有另一个问题:跨越页面边界的单词将被计数两次。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM