简体   繁体   English

提高在C中合并两个文件的IO性能

[英]Improving IO performance for merging two files in C

I wrote a function which merges two large files ( file1,file2 ) into a new file ( outputFile ). 我写了一个函数,它将两个大文件( file1,file2 )合并到一个新文件( outputFile )中。 Each file is a line based format while entries are separated by \\0 byte. 每个文件都是基于行的格式,而条目以\\ 0字节分隔。 Both files have the same amount of null bytes. 两个文件都具有相同数量的空字节。

One example file with two entries could look like this A\\nB\\n\\0C\\nZ\\nB\\n\\0 具有两个条目的一个示例文件可能看起来像这样A\\nB\\n\\0C\\nZ\\nB\\n\\0

   Input:
   file1: A\nB\0C\nZ\nB\n\0
   file2: BBA\nAB\0T\nASDF\nQ\n\0
   Output
   outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0

FILE * outputFile = fopen(...);
setvbuf ( outputFile  , NULL , _IOFBF , 1024*1024*1024 )
FILE * file1 = fopen(...); 
FILE * file2 = fopen(...); 
int c1, c2;
while((c1=fgetc(file1)) != EOF) {
    if(c1 == '\0'){
        while((c2=fgetc(file2)) != EOF && c2 != '\0') {
            fwrite(&c2, sizeof(char), 1, outputFile);
        }
        char nullByte = '\0';
        fwrite(&nullByte, sizeof(char), 1, outputFile);
    }else{
        fwrite(&c1, sizeof(char), 1, outputFile);
    }
}

Is there a way to improve this IO performance of this function? 有没有办法提高这个功能的IO性能? I increased the buffer size of outputFile to 1 GB by using setvbuf . 我使用setvbufoutputFile的缓冲区大小增加到1 GB。 Would it help to use posix_fadvise on file1 and file2? 在file1和file2上使用posix_fadvise帮助吗?

You're doing IO character-by-character. 你正在逐个字符地进行IO。 That is going to be needlessly and painfully SLOW, even with buffered streams. 即使使用缓冲流,这也将是不必要且痛苦的缓慢。

Take advantage of the fact that your data is stored in your files as NUL-terminated strings. 利用您的数据作为NUL终止字符串存储在文件中的事实。

Assuming you're alternating nul-terminated strings from each file, and running on a POSIX platform so you can simply mmap() the input files: 假设您从每个文件中交替使用以空字符结尾的字符串,并在POSIX平台上运行,这样您就可以简单地mmap()输入文件:

typedef struct mapdata
{
    const char *ptr;
    size_t bytes;
} mapdata_t;

mapdata_t mapFile( const char *filename )
{
    mapdata_t data;
    struct stat sb;

    int fd = open( filename, O_RDONLY );
    fstat( fd, &sb );

    data.bytes = sb.st_size;

    /* assumes we have a NUL byte after the file data 
       If the size of the file is an exact multiple of the
       page size, we won't have the terminating NUL byte! */
    data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
    close( fd );
    return( data );
}

void unmapFile( mapdata_t data )
{
    munmap( data.ptr, data.bytes );
}

void mergeFiles( const char *file1, const char *file2, const char *output )
{
    char zeroByte = '\0';

    mapdata_t data1 = mapFile( file1 );
    mapdata_t data2 = mapFile( file2 );

    size_t strOffset1 = 0UL;
    size_t strOffset2 = 0UL;

    /* get a page-aligned buffer - a 64kB alignment should work */
    char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL );

    /* memset the buffer to ensure the virtual mappings exist */
    memset( iobuffer, 0, 1024UL * 1024UL );

    /* use of direct IO should reduce memory pressure - the 1 MB
       buffer is already pretty large, and since we're not seeking
       the page cache is really only slowing things down */
    int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 );

    FILE *outputfile = fdopen( fd, "wb" );
    setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL );

    /* loop until we reach the end of either mapped file */
    for ( ;; )
    {
        fputs( data1.ptr + strOffset1, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        fputs( data2.ptr + strOffset2, outputfile );
        fwrite( &zeroByte, 1, 1, outputfile );

        /* skip over the string, assuming there's one NUL
           byte in between strings */
        strOffset1 += 1 + strlen( data1.ptr + strOffset1 );
        strOffset2 += 1 + strlen( data2.ptr + strOffset2 );

        /* if either offset is too big, end the loop */
        if ( ( strOffset1 >= data1.bytes ) ||
             ( strOffset2 >= data2.bytes ) )
        {
            break;
        }
    }

    fclose( outputfile );

    unmapFile( data1 );
    unmapFile( data2 );       
}

I've put in no error checking at all. 我根本没有进行任何错误检查。 You'll also need to add the proper header files. 您还需要添加正确的头文件。

Note also that the file data is assumed to NOT be an exact multiple of the system page size, thus ensuring that there's a NUL byte mapped after the file contents. 另请注意,假定文件数据不是系统页面大小的精确倍数,从而确保在文件内容之后映射了NUL字节。 If the size of the file is an exact multiple of the page size, you'll have to mmap() an additional page after the file contents to ensure that there's a NUL byte to terminate the last string. 如果文件的大小是页面大小的精确倍数,则必须在文件内容之后mmap()另一个页面,以确保有一个NUL字节来终止最后一个字符串。

Or you can rely on there being a NUL byte as the last byte of the file's contents. 或者你可以依赖于NUL字节作为文件内容的最后一个字节。 If that ever turns out to not be true, you'll likely get either a SEGV or corrupted data. 如果事实证明不成立,您可能会得到SEGV或损坏的数据。

  • you are using two function calls per character, (one for input, one for output) Function calls are slow (they pollute the instruction pipeline) 你在每个字符使用两个函数调用(一个用于输入,一个用于输出)函数调用很慢(它们污染了指令管道)
  • fgetc() and fputc have their getc() / putc() counterparts, which are (can be) implemented as macros, enabling the compiler to inline the entire loop , except for the reading/writing of buffers , twice per 512 or 1024 or 4096 characters processed. fgetc()和fputc有他们的getc()/ putc()对应物,它们(可以)实现为宏,使编译器能够内联整个循环 ,除了读取/写入缓冲区,每512或1024两次或处理了4096个字符。 (these will invoke system calls, but these are inevitable anyway) (这些将调用系统调用,但这些都是不可避免的)
  • using read/write instead of buffered I/O will probably not be worth the effort, the extra bookkeeping wil make your loop fatter (btw: using fwrite() to write one character is certainly wastefull, same for write()) 使用读/写而不是缓冲的I / O可能不值得付出努力,额外的簿记将使你的循环变得更胖 (顺便说一句:使用fwrite()写一个字符肯定是浪费,对于write()也是如此)
  • maybe a larger output buffer could help, but I wouldnt count on that. 也许更大的输出缓冲区可能会有所帮助,但我不会指望它。

A minor improvement would be that if you are going to write individual characters, you should use fputc rather than fwrite . 一个小的改进是,如果你要写单个字符,你应该使用fputc而不是fwrite

Furthermore, since you care about speed, you should try out putc and getc rather than fputc and fgetc to see if it runs any faster. 此外,既然你关心速度,你应该尝试使用putcgetc而不是fputcfgetc来查看它是否运行得更快。

If you can use threads, make one for file1 and another for file2. 如果您可以使用线程,请为file1创建一个,为file2创建另一个。

Make the outputFile as big as you need, then make thread1 write the file1 into outputFile . 使outputFile尽可能大,然后让thread1将file1写入outputFile

While thread2 seek it's output of outputFile the the length of file1+1, and write file2 当thread2寻找它的outputFile输出时,file1 + 1的长度,并写入file2

Edit: 编辑:

It's not a correct answer for this case , but to prevent confusions I'll let it here. 对于这种情况这不是一个正确的答案 ,但为了防止混淆,我会把它放在这里。

More discusion I found about it: improve performance in file IO in C 我发现了更多关于它的错误: 在C中提高文件IO的性能

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM