简体   繁体   English

在Ubuntu(Linux OS)中将一个巨大的文件读入C ++向量

[英]Reading a huge file into a C++ vector in Ubuntu (linux OS)

In my C++ program running in linux (Ubuntu 14.4), I need to read a 90 GB file completely buffered in a C++ vector, and I have only 125 GB memory. 在我在Linux(Ubuntu 14.4)中运行的C ++程序中,我需要读取一个完全缓存在C ++向量中的90 GB文件,而我只有125 GB的内存。

When I read the file chunk by chunk, it continuously results in increase in cached mem usage in linux, which turns out to be more than 50% of the 128 GB mem, then the free memory easily becomes under 50 GB. 当我按块读取文件块时,它会不断导致linux中缓存的mem使用率增加,结果是128 GB内存的50%以上,然后可用内存容易变为50 GB以下。

              total        used        free      shared  buff/cache   available
Mem:            125          60         0           0          65         65

Swap: 255 0 255

So I found that the free memory then becomes zero, and the file reading process almost got stopped, and I have to manually run: 所以我发现可用内存然后变为零,文件读取过程几乎停止了,我必须手动运行:

echo 3 | sudo tee /proc/sys/vm/drop_caches

to clear cached mem, so that the file reading process resumes. 清除缓存的mem,以便恢复文件读取过程。 I understand cached mem is to speed up reading a file again. 我理解缓存的mem是为了加快再次读取文件。 My question is how could I avoid manually running the drop cache command to ensure the file reading process would successfully complete? 我的问题是如何避免手动运行drop cache命令以确保文件读取过程能够成功完成?

Since you are simply streaming the data and never rereading it, the page cache does you no good whatsoever. 由于您只是简单地传输数据并且从不重读它,因此页面缓存对您没有任何好处。 In fact, given the amount of data you're pushing through the page cache, and the memory pressure from your application, otherwise useful data is likely evicted from the page cache and your system performance suffers because of that. 实际上,考虑到您正在推送页面缓存的数据量以及应用程序的内存压力,否则可能会从页面缓存中逐出有用的数据,因此您的系统性能会受到影响。

So don't use the cache when reading your data. 所以在读取数据时不要使用缓存。 Use direct IO. 使用直接IO。 Per the Linux open() man page : 根据Linux open()手册页

O_DIRECT (since Linux 2.4.10) O_DIRECT (自Linux 2.4.10起)

Try to minimize cache effects of the I/O to and from this file. 尝试最小化I / O与此文件之间的缓存效果。 In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. 通常,这会降低性能,但在特殊情况下很有用,例如应用程序执行自己的缓存时。 File I/O is done directly to/from user- space buffers. 文件I / O直接进出用户空间缓冲区。 The O_DIRECT flag on its own makes an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC flag that data and necessary metadata are transferred. O_DIRECT标志本身会同步传输数据,但不会保证O_SYNC标志传输数据和必要的元数据。 To guarantee synchronous I/O, O_SYNC must be used in addition to O_DIRECT . 为了保证同步I / O,除O_DIRECT外还必须使用O_SYNC See NOTES below for further discussion. 请参阅下面的注释以进一步讨论。

... ...

NOTES 笔记

... ...

O_DIRECT O_DIRECT

The O_DIRECT flag may impose alignment restrictions on the length and address of user-space buffers and the file offset of I/Os. O_DIRECT标志可能会对用户空间缓冲区的长度和地址以及I / O的文件偏移量施加对齐限制。 In Linux alignment restrictions vary by filesystem and kernel version and might be absent entirely. 在Linux中,对齐限制因文件系统和内核版本而异,可能完全不存在。 However there is currently no filesystem-independent interface for an application to discover these restrictions for a given file or filesystem. 但是,当前没有与文件系统无关的接口,应用程序可以发现给定文件或文件系统的这些限制。 Some filesystems provide their own interfaces for doing so, for example the XFS_IOC_DIOINFO operation in xfsctl(3). 某些文件系统提供了自己的接口,例如xfsctl(3)中的XFS_IOC_DIOINFO操作。

Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the filesystem. 在Linux 2.4下,传输大小,用户缓冲区和文件偏移的对齐必须都是文件系统逻辑块大小的倍数。 Since Linux 2.6.0, alignment to the logical block size of the underlying storage (typically 512 bytes) suffices. 从Linux 2.6.0开始,对齐底层存储的逻辑块大小(通常为512字节)就足够了。 The logical block size can be determined using the ioctl(2) BLKSSZGET operation or from the shell using the command: 可以使用ioctl(2)BLKSSZGET操作或使用以下命令从shell确定逻辑块大小:

  blockdev --getss 

... ...

Since you are not reading the data over and over, direct IO is likely to improve performance somewhat, as the data will go directly from disk into your application's memory instead of from disk, to the page cache, and then into your application's memory. 由于您不是一遍又一遍地读取数据,因此直接IO可能会在某种程度上提高性能,因为数据将直接从磁盘进入应用程序的内存而不是从磁盘到页面缓存,然后进入应用程序的内存。

Use low-level, C-style I/O with open() / read() / close() , and open the file with the O_DIRECT flag: 使用带有open() / read() / close()的低级C风格I / O,并使用O_DIRECT标志打开文件:

int fd = ::open( filename, O_RDONLY | O_DIRECT );

This will result in the data being read directly into the application's memory, without being cached in the system's page cache. 这将导致数据直接读入应用程序的内存,而不会缓存在系统的页面缓存中。

You'll have to read() using aligned memory, so you'll need something like this to actually read the data: 你必须使用对齐的内存来read() ,所以你需要这样的东西来实际读取数据:

char *buffer;
size_t pageSize = sysconf( _SC_PAGESIZE );
size_t bufferSize = 32UL * pageSize;

int rc = ::posix_memalign( ( void ** ) &buffer, pageSize, bufferSize );

posix_memalign() is a POSIX-standard function that returns a pointer to memory aligned as requested. posix_memalign()是一个POSIX标准函数 ,它返回一个指向内存的指针。 Page-aligned buffers are usually more than sufficient, but aligning to hugepage size (2MiB on x86-64) will hint the kernel that you want transparent hugepages for that allocation, making access to your buffer more efficient when you read it later. 页面对齐的缓冲区通常绰绰有余,但是对齐巨大的页面大小(x86-64上的2MiB)会暗示内核需要透明的大页面进行分配,以便在以后读取时更有效地访问缓冲区。

ssize_t bytesRead = ::read( fd, buffer, bufferSize );

Without your code, I can't say how to get the data from buffer into your std::vector , but it shouldn't be hard. 没有你的代码,我不能说如何从buffer获取数据到你的std::vector ,但它不应该很难。 There are likely ways to wrap the C-style low-level file descriptor with a C++ stream of some type, and to configure that stream to use memory properly aligned for direct IO. 有可能使用某种类型的C ++流包装C风格的低级文件描述符,并配置该流以使用正确对齐直接IO的内存。

If you want to see the difference, try this: 如果你想看到差异,试试这个:

echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file of=/dev/null bs=32k

Time that. 时间。 Then look at the amount of data in the page cache. 然后查看页面缓存中的数据量。

Then do this: 然后这样做:

echo 3 | sudo tee /proc/sys/vm/drop_caches
dd if=/your/big/data/file iflag=direct of=/dev/null bs=32k

Check the amount of data in the page cache after that... 之后检查页面缓存中的数据量...

You can experiment with different block sizes to see what works best on your hardware and filesystem. 您可以尝试使用不同的块大小来查看最适合您的硬件和文件系统的块大小。

Note well , though, that direct IO is very implementation-dependent. 但请注意 ,直接IO 非常依赖于实现。 Requirements to perform direct IO can vary significantly between different filesystems, and performance can vary drastically depending on your IO pattern and your specific hardware. 执行直接IO的要求在不同文件系统之间可能有很大差异,并且性能可能会根据您的IO模式和特定硬件而有很大差异。 Most of the time it's not worth those dependencies, but the one simple use where it usually is worthwhile is streaming a huge file without rereading/rewriting any part of the data. 大多数情况下,这些依赖项并不值得,但通常值得使用的一个简单用途是流式传输大型文件而无需重新读取/重写数据的任何部分。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM