简体   繁体   English

何时为I / O(C ++)构建自己的缓冲系统?

[英]When to build your own buffer system for I/O (C++)?

I have to deal with very large text files (2 GBs), it is mandatory to read/write them line by line. 我必须处理非常大的文本文件(2 GB),必须逐行读/写它们。 To write 23 millions of lines using ofstream is really slow so, at the beginning, I tried to speed up the process writing large chunks of lines in a memory buffer (for example 256 MB or 512 MB) and then write the buffer into the file. 使用ofstream编写2300万行非常慢,所以在开始时,我试图加快在内存缓冲区(例如256 MB或512 MB)中写入大块行的过程,然后将缓冲区写入文件。 This did not work, the performance is more or less the same. 这不起作用,性能或多或少相同。 I have the same problem reading the files. 我在阅读文件时遇到同样的问题。 I know the I/O operations are buffered by the STL I/O system and this also depends on the disk scheduler policy (managed by the OS, in my case Linux). 我知道I / O操作是由STL I / O系统缓冲的,这也取决于磁盘调度程序策略(由操作系统管理,在我的情况下是Linux)。

Any idea about how to improve the performance? 有关如何提高性能的任何想法?

PS: I have been thinking about using a background child process (or a thread) to read/write the data chunks while the program is processing data but I do not know (mainly in the case of the subprocess) if this will be worthy. PS:我一直在考虑使用后台子进程(或线程)来读取/写入数据块,而程序正在处理数据但我不知道(主要是在子进程的情况下)这是否值得。

A 2GB file is pretty big, and you need to be aware of all the possible areas that can act as bottlenecks: 2GB的文件非常大,您需要了解可能充当瓶颈的所有可能区域:

  • The HDD itself 硬盘本身
  • The HDD interface (IDE/SATA/RAID/USB?) 硬盘接口(IDE / SATA / RAID / USB?)
  • Operating system/filesystem 操作系统/文件系统
  • C/C++ Library C / C ++库
  • Your code 你的代码

I'd start by doing some measurements: 我首先做一些测量:

  • How long does your code take to read/write a 2GB file, 您的代码读取/写入2GB文件需要多长时间,
  • How fast can the ' dd ' command read and write to disk? dd ”命令读取和写入磁盘的速度有多快? Example... 例...

    dd if=/dev/zero bs=1024 count=2000000 of=file_2GB

  • How long does it take to write/read using just big fwrite()/fread() calls 使用大的fwrite()/ fread()调用写/读需要多长时间

Assuming your disk is capable of reading/writing at about 40Mb/s (which is probably a realistic figure to start from), your 2GB file can't run faster than about 50 seconds. 假设您的磁盘能够以大约40Mb / s的速度进行读/写(这可能是一个真实的数字),您的2GB文件运行速度不会超过50秒。

How long is it actually taking? 它实际需要多长时间?

Hi Roddy, using fstream read method with 1.1 GB files and large buffers(128,255 or 512 MB) it takes about 43-48 seconds and it is the same using fstream getline (line by line). 嗨Roddy,使用带有1.1 GB文件和大缓冲区(128,255或512 MB)的fstream读取方法,它需要大约43-48秒,并且使用fstream getline(逐行)是相同的。 cp takes almost 2 minutes to copy the file. cp需要将近2分钟来复制文件。

In which case, your're hardware-bound. 在这种情况下,你的硬件绑定。 cp has to read and write, and will be seeking back and forth across the disk surface like mad when it does it. cp必须读写,并且会在疯狂的情况下在磁盘表面上来回寻找。 So it will (as you see) be more than twice as bad as the simple 'read' case. 所以它(如你所见)将比简单的'读'案例差两倍多。

To improve the speed, the first thing I'd try is a faster hard drive, or an SSD. 为了提高速度,我首先尝试的是更快的硬盘驱动器或SSD。

You haven't said what the disk interface is? 你还没说过磁盘接口是什么? SATA is pretty much the easiest/fastest option. SATA几乎是最简单/最快的选择。 Also (obvious point, this...) make sure the disk is physically on the same machine your code is running, otherwise you're network-bound... 另外(显而易见的是,这......)确保磁盘实际上在您的代码运行的同一台机器上,否则您将受到网络限制......

我还建议使用内存映射文件但是如果你要使用boost我认为boost :: iostreams :: mapped_file比boost :: interprocess更好。

Maybe you should look into memory mapped files. 也许你应该研究内存映射文件。

Check them in this library : Boost.Interprocess 在这个库中检查它们: Boost.Interprocess

Just a thought, but avoid using std::endl as this will force a flush before the buffer is full. 只是一个想法,但避免使用std :: endl,因为这将在缓冲区满之前强制刷新。 Use '\\n' instead for a newline. 使用'\\ n'代替换行符。

Don't use new to allocate the buffer like that: 不要使用new来分配缓冲区:

Try: std::vector<> 尝试:std :: vector <>

unsigned int      buffer_size = 64 * 1024 * 1024; // 64 MB for instance.
std::vector<char> data_buffer(buffer_size);
_file->read(&data_buffer[0], buffer_size);

Also read the article on using underscore in identifier names: . 另请阅读有关在标识符名称中使用下划线的文章 Note your code is OK but. 注意你的代码没问题但是。

Using getline() may be inefficient because the string buffer may need to be re-sized several times as data is appended to it from the stream buffer. 使用getline()可能效率很低,因为当从流缓冲区向数据附加数据时,字符串缓冲区可能需要重新调整大小几次。 You can make this more efficient by pre-sizing the string: 您可以通过预先调整字符串大小来提高效率:

Also you can set the size of the iostreams buffer to either very large or NULL(for unbuffered) 您还可以将iostreams缓冲区的大小设置为非常大或NULL(对于无缓冲)

// Unbuffered Accesses:
fstream file;
file.rdbuf()->pubsetbuf(NULL,0);
file.open("PLOP");

// Larger Buffer
std::vector<char>  buffer(64 * 1024 * 1024);
fstream            file;
file.rdbuf()->pubsetbuf(&buffer[0],buffer.size());
file.open("PLOP");

std::string   line;
line.reserve(64 * 1024 * 1024);

while(getline(file,line))
{
    // Do Stuff.
}

If you are going to buffer the file yourself, then I'd advise some testing using unbuffered I/O (setvbuf on a file that you've fopened can turn off the library buffering). 如果你要自己缓冲文件,那么我建议使用无缓冲的I / O进行一些测试(在你已经开启的文件上使用setvbuf可以关闭库缓冲)。

Basically, if you are going to buffer yourself, you want to disable the library's buffering, as it's only going to cause you pain. 基本上,如果你要缓冲自己,你想要禁用库的缓冲,因为它只会让你感到痛苦。 I don't know if there is any way to do that for STL I/O, so I recommend going down to the C-level I/O. 我不知道是否有任何方法可以为STL I / O做到这一点,所以我建议你去C级I / O.

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM