[英]Improving IO performance for merging two files in C
I wrote a function which merges two large files ( file1,file2
) into a new file ( outputFile
). 我写了一个函数,它将两个大文件(
file1,file2
)合并到一个新文件( outputFile
)中。 Each file is a line based format while entries are separated by \\0 byte. 每个文件都是基于行的格式,而条目以\\ 0字节分隔。 Both files have the same amount of null bytes.
两个文件都具有相同数量的空字节。
One example file with two entries could look like this A\\nB\\n\\0C\\nZ\\nB\\n\\0
具有两个条目的一个示例文件可能看起来像这样
A\\nB\\n\\0C\\nZ\\nB\\n\\0
Input:
file1: A\nB\0C\nZ\nB\n\0
file2: BBA\nAB\0T\nASDF\nQ\n\0
Output
outputFile: A\nB\nBBA\nAB\0C\nZ\nB\nT\nASDF\nQ\n\0
FILE * outputFile = fopen(...);
setvbuf ( outputFile , NULL , _IOFBF , 1024*1024*1024 )
FILE * file1 = fopen(...);
FILE * file2 = fopen(...);
int c1, c2;
while((c1=fgetc(file1)) != EOF) {
if(c1 == '\0'){
while((c2=fgetc(file2)) != EOF && c2 != '\0') {
fwrite(&c2, sizeof(char), 1, outputFile);
}
char nullByte = '\0';
fwrite(&nullByte, sizeof(char), 1, outputFile);
}else{
fwrite(&c1, sizeof(char), 1, outputFile);
}
}
Is there a way to improve this IO performance of this function? 有没有办法提高这个功能的IO性能? I increased the buffer size of
outputFile
to 1 GB by using setvbuf
. 我使用
setvbuf
将outputFile
的缓冲区大小增加到1 GB。 Would it help to use posix_fadvise
on file1 and file2? 在file1和file2上使用
posix_fadvise
帮助吗?
You're doing IO character-by-character. 你正在逐个字符地进行IO。 That is going to be needlessly and painfully SLOW, even with buffered streams.
即使使用缓冲流,这也将是不必要且痛苦的缓慢。
Take advantage of the fact that your data is stored in your files as NUL-terminated strings. 利用您的数据作为NUL终止字符串存储在文件中的事实。
Assuming you're alternating nul-terminated strings from each file, and running on a POSIX platform so you can simply mmap()
the input files: 假设您从每个文件中交替使用以空字符结尾的字符串,并在POSIX平台上运行,这样您就可以简单地
mmap()
输入文件:
typedef struct mapdata
{
const char *ptr;
size_t bytes;
} mapdata_t;
mapdata_t mapFile( const char *filename )
{
mapdata_t data;
struct stat sb;
int fd = open( filename, O_RDONLY );
fstat( fd, &sb );
data.bytes = sb.st_size;
/* assumes we have a NUL byte after the file data
If the size of the file is an exact multiple of the
page size, we won't have the terminating NUL byte! */
data.ptr = mmap( NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fd, 0 );
close( fd );
return( data );
}
void unmapFile( mapdata_t data )
{
munmap( data.ptr, data.bytes );
}
void mergeFiles( const char *file1, const char *file2, const char *output )
{
char zeroByte = '\0';
mapdata_t data1 = mapFile( file1 );
mapdata_t data2 = mapFile( file2 );
size_t strOffset1 = 0UL;
size_t strOffset2 = 0UL;
/* get a page-aligned buffer - a 64kB alignment should work */
char *iobuffer = memalign( 64UL * 1024UL, 1024UL * 1024UL );
/* memset the buffer to ensure the virtual mappings exist */
memset( iobuffer, 0, 1024UL * 1024UL );
/* use of direct IO should reduce memory pressure - the 1 MB
buffer is already pretty large, and since we're not seeking
the page cache is really only slowing things down */
int fd = open( output, O_RDWR | O_TRUNC | O_CREAT | O_DIRECT, 0644 );
FILE *outputfile = fdopen( fd, "wb" );
setvbuf( outputfile, iobuffer, _IOFBF, 1024UL * 1024UL );
/* loop until we reach the end of either mapped file */
for ( ;; )
{
fputs( data1.ptr + strOffset1, outputfile );
fwrite( &zeroByte, 1, 1, outputfile );
fputs( data2.ptr + strOffset2, outputfile );
fwrite( &zeroByte, 1, 1, outputfile );
/* skip over the string, assuming there's one NUL
byte in between strings */
strOffset1 += 1 + strlen( data1.ptr + strOffset1 );
strOffset2 += 1 + strlen( data2.ptr + strOffset2 );
/* if either offset is too big, end the loop */
if ( ( strOffset1 >= data1.bytes ) ||
( strOffset2 >= data2.bytes ) )
{
break;
}
}
fclose( outputfile );
unmapFile( data1 );
unmapFile( data2 );
}
I've put in no error checking at all. 我根本没有进行任何错误检查。 You'll also need to add the proper header files.
您还需要添加正确的头文件。
Note also that the file data is assumed to NOT be an exact multiple of the system page size, thus ensuring that there's a NUL byte mapped after the file contents. 另请注意,假定文件数据不是系统页面大小的精确倍数,从而确保在文件内容之后映射了NUL字节。 If the size of the file is an exact multiple of the page size, you'll have to
mmap()
an additional page after the file contents to ensure that there's a NUL byte to terminate the last string. 如果文件的大小是页面大小的精确倍数,则必须在文件内容之后
mmap()
另一个页面,以确保有一个NUL字节来终止最后一个字符串。
Or you can rely on there being a NUL byte as the last byte of the file's contents. 或者你可以依赖于NUL字节作为文件内容的最后一个字节。 If that ever turns out to not be true, you'll likely get either a SEGV or corrupted data.
如果事实证明不成立,您可能会得到SEGV或损坏的数据。
A minor improvement would be that if you are going to write individual characters, you should use fputc
rather than fwrite
. 一个小的改进是,如果你要写单个字符,你应该使用
fputc
而不是fwrite
。
Furthermore, since you care about speed, you should try out putc
and getc
rather than fputc
and fgetc
to see if it runs any faster. 此外,既然你关心速度,你应该尝试使用
putc
和getc
而不是fputc
和fgetc
来查看它是否运行得更快。
If you can use threads, make one for file1 and another for file2. 如果您可以使用线程,请为file1创建一个,为file2创建另一个。
Make the outputFile
as big as you need, then make thread1 write the file1 into outputFile
. 使
outputFile
尽可能大,然后让thread1将file1写入outputFile
。
While thread2 seek it's output of outputFile
the the length of file1+1, and write file2 当thread2寻找它的
outputFile
输出时,file1 + 1的长度,并写入file2
Edit: 编辑:
It's not a correct answer for this case , but to prevent confusions I'll let it here. 对于这种情况 , 这不是一个正确的答案 ,但为了防止混淆,我会把它放在这里。
More discusion I found about it: improve performance in file IO in C 我发现了更多关于它的错误: 在C中提高文件IO的性能
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.