简体   繁体   English

C++读取超大文件的方法

[英]How to read huge file in c++

If I have a huge file (eg. 1TB, or any size that does not fit into RAM. The file is stored on the disk).如果我有一个巨大的文件(例如 1TB,或任何不适合 RAM 的大小。文件存储在磁盘上)。 It is delimited by space.它由空格分隔。 And my RAM is only 8GB.我的内存只有 8GB。 Can I read that file in ifstream?我可以在 ifstream 中读取该文件吗? If not, how to read a block of file (eg. 4GB)?如果不是,如何读取一个文件块(例如 4GB)?

There are a couple of things that you can do.您可以做几件事。

First, there's no problem opening a file that is larger than the amount of RAM that you have.首先,打开比您拥有的 RAM 容量大的文件没有问题。 What you won't be able to do is copy the whole file live into your memory.您将无法做的是将整个文件实时复制到您的内存中。 The best thing would be for you to find a way to read just a few chunks at a time and process them.最好的办法是找到一种方法一次只读取几个块并处理它们。 You can use ifstream for that purpose (with ifstream.read , for instance).您可以为此目的使用ifstream (例如,使用ifstream.read )。 Allocate, say, one megabyte of memory, read the first megabyte of that file into it, rinse and repeat:分配,比如说,一兆字节的内存,将该文件的第一个兆字节读入其中,冲洗并重复:

ifstream bigFile("mybigfile.dat");
constexpr size_t bufferSize = 1024 * 1024;
unique_ptr<char[]> buffer(new char[bufferSize]);
while (bigFile)
{
    bigFile.read(buffer.get(), bufferSize);
    // process data in buffer
}

Another solution is to map the file to memory.另一种解决方案是将文件映射到内存。 Most operating systems will allow you to map a file to memory even if it is larger than the physical amount of memory that you have.大多数操作系统都允许您将文件映射到内存,即使它大于您拥有的物理内存量。 This works because the operating system knows that each memory page associated with the file can be mapped and unmapped on-demand: when your program needs a specific page, the OS will read it from the file into your process's memory and swap out a page that hasn't been used in a while.这是可行的,因为操作系统知道与文件关联的每个内存页面都可以按需映射和取消映射:当您的程序需要特定页面时,操作系统会将其从文件中读入您的进程内存并换出一个页面有一段时间没用了。

However, this can only work if the file is smaller than the maximum amount of memory that your process can theoretically use.但是,这只有在文件小于您的进程理论上可以使用的最大内存量时才有效。 This isn't an issue with a 1TB file in a 64-bit process, but it wouldn't work in a 32-bit process.对于 64 位进程中的 1TB 文件,这不是问题,但它不适用于 32 位进程。

Also be aware of the spirits that you're summoning .还要注意你召唤的灵魂 Memory-mapping a file is not the same thing as reading from it.内存映射文件与从中读取文件不同。 If the file is suddenly truncated from another program, your program is likely to crash.如果该文件突然被另一个程序截断,您的程序很可能会崩溃。 If you modify the data, it's possible that you will run out of memory if you can't save back to the disk.如果你修改了数据,如果你不能存回磁盘,你可能会用完内存。 Also, your operating system's algorithm for paging in and out memory may not behave in a way that advantages you significantly.此外,您的操作系统用于内存分页和分页的算法可能不会以对您有显着优势的方式运行。 Because of these uncertainties, I would consider mapping the file only if reading it in chunks using the first solution cannot work.由于存在这些不确定性,只有在使用第一种解决方案无法按块读取文件时,我才会考虑映射文件。

On Linux/OS X, you would use mmap for it.在 Linux/OS X 上,您可以使用mmap On Windows, you would open a file and then use CreateFileMapping then MapViewOfFile .在 Windows 上,您将打开一个文件,然后使用CreateFileMapping然后使用MapViewOfFile

I am sure you don't have to keep all the file in memory.我相信您不必将所有文件都保存在内存中。 Typically one wants to read and process file by chunks.通常,人们希望按块读取和处理文件。 If you want to use ifstream , you can do something like that:如果你想使用ifstream ,你可以这样做:

ifstream is("/path/to/file");
char buf[4096];
do {
    is.read(buf, sizeof(buf));
    process_chunk(buf, is.gcount());
} while(is);

A more advances aproach is to instead of reading whole file or its chunks to memory you can map it to memory using platform specific apis:一个更先进的方法是不是将整个文件或其块读取到内存中,而是可以使用平台特定的 api 将其映射到内存:

Under windows: CreateFileMapping(), MapViewOfFile() windows下:CreateFileMapping()、MapViewOfFile()

Under linux: open(2) / creat(2), shm_open, mmap linux下:open(2) / creat(2), shm_open, mmap

you will need to compile 64bit app to make it work.您需要编译 64 位应用程序才能使其正常工作。

for more details see here: CreateFileMapping, MapViewOfFile, how to avoid holding up the system memory有关详细信息,请参见此处: CreateFileMapping、MapViewOfFile、如何避免占用系统内存

You can use fread你可以使用恐惧

char buffer[size];
fread(buffer, size, sizeof(char), fp);

Or, if you want to use C++ fstreams you can use read as buratino said.或者,如果您想使用 C++ fstreams,您可以像buratino所说的那样使用read

Also have in mind that you can open a file regardless of its size, the idea is to open it and read it in chucks that fit in your RAM.还要记住,无论文件大小如何,您都可以打开它,这个想法是打开它并在适合您的 RAM 的卡盘中读取它。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM