简体   繁体   English

C ++逐步读取大文件

[英]C++ reading large files part by part

I've been having a problem that I not been able to solve as of yet. 我遇到了一个我至今无法解决的问题。 This problem is related to reading files, I've looked at threads even on this website and they do not seem to solve the problem. 这个问题与读取文件有关,我什至在这个网站上都看过线程,它们似乎并不能解决问题。 That problem is reading files that are larger than a computers system memory. 该问题是读取大于计算机系统内存的文件。 Simply when I asked this question a while ago I was referred too using the following code. 只是当我不久前问这个问题时,我也使用以下代码进行了引用。

    string data("");
getline(cin,data);
std::ifstream is (data);//, std::ifstream::binary);
if (is) 
{
    // get length of file:
    is.seekg (0, is.end);
    int length = is.tellg();
    is.seekg (0, is.beg);
    // allocate memory:
    char * buffer = new char [length];
    // read data as a block:
    is.read (buffer,length);
    is.close();
    // print content:

    std::cout.write (buffer,length);
    delete[] buffer;
}
system("pause");

This code works well apart from the fact that it eats memory like fat kid in a candy store. 该代码除了在糖果店里像胖孩子一样吃掉内存之外,还可以很好地工作。 So after a lot of ghetto and unrefined programing, I was able to figure out a way to sort of fix the problem. 因此,经过大量的贫民区和未经改进的编程,我得以找到一种解决问题的方法。 However I more or less traded one problem for another in my quest. 但是,在我追求的过程中,我或多或少地将一个问题换成了另一个。

#include <iostream>
#include <vector>
#include <string>
#include <fstream>
#include <stdio.h> 
#include <stdlib.h>
#include <iomanip>
#include <windows.h>
#include <cstdlib>
#include <thread>

using namespace std;
/*======================================================*/
    string *fileName = new string("tldr");
    char data[36];
    int filePos(0); // The pos of the file
    int tmSize(0); // The total size of the file    

    int split(32);
    char buff;
    int DNum(0);
/*======================================================*/



int getFileSize(std::string filename) // path to file
{
    FILE *p_file = NULL;
    p_file = fopen(filename.c_str(),"rb");
    fseek(p_file,0,SEEK_END);
    int size = ftell(p_file);
    fclose(p_file);
    return size;
}

void fs()
{
    tmSize = getFileSize(*fileName);
    int AX(0);
    ifstream fileIn;
    fileIn.open(*fileName, ios::in | ios::binary);
    int n1,n2,n3;
    n1 = tmSize / 32;

    // Does the processing
    while(filePos != tmSize)
    {
        fileIn.seekg(filePos,ios_base::beg);
        buff = fileIn.get();
        // To take into account small files
        if(tmSize < 32)
        {
            int Count(0);
            char MT[40];
            if(Count != tmSize)
            {
                MT[Count] = buff;
                cout << MT[Count];// << endl;
                Count++;
            }
        }
        // Anything larger than 32
        else
        {
            if(AX != split)
            {
                data[AX] = buff;
                AX++;
                if(AX == split)
                {

                    AX = 0;
                }
            }

        }
        filePos++;
    }
    int tz(0);
    filePos = filePos - 12;

    while(tz != 2)
    {
        fileIn.seekg(filePos,ios_base::beg);
        buff = fileIn.get();
        data[tz] = buff;
        tz++;
        filePos++;
    }

    fileIn.close();
}

void main ()
{
    fs();
    cout << tmSize << endl;
    system("pause");
}

What I tried to do with this code is too work around the memory issue. 我试图用此代码来解决内存问题。 Rather than allocating memory for a large file that simply does not exist on a my system, I tried to use the memory I had instead which is about 8gb, but I only wanted to use maybe a few Kilobytes of it if at all possible. 我没有为我的系统上根本不存在的大文件分配内存,而是尝试使用大约8gb的内存,但我只想尽可能使用几千字节的内存。 To give you a layout of what I am talking about I am going to write a line of text. 为了给您一个我所谈论的内容的布局,我将写一行文本。 "Hello my name is cake please give me cake" Basically what I did was read said piece of text letter by letter. “你好,我叫蛋糕,请给我蛋糕。”基本上我所做的是逐字逐句地说。 Then I put those letters into a box that could store 32 of them, from there I could use something like xor and then write them onto another file. 然后,我将这些字母放到一个可以存储32个字母的盒子中,从那里可以使用xor之类的东西,然后将它们写到另一个文件中。 The idea in a way works but it is horribly slow and leaves off parts of files. 这个想法在某种程度上可行,但它非常慢,并且遗漏了部分文件。 So basically how can I make something like this work without going slow or cutting off files. 因此,基本上,我该如何进行类似的工作而又不减慢速度或切断文件。 I would love to see how xor works with very large files. 我很想看看xor如何处理非常大的文件。 So if anyone has a better idea than what I have, then I would be very grateful for the help. 因此,如果有人比我有一个更好的主意,那么我将非常感谢您的帮助。

To read and process the file piece-by-piece, you can use the following snippet: 要逐个读取和处理文件,可以使用以下代码段:

// Buffer size 1 Megabyte (or any number you like)
size_t buffer_size = 1<<20;
char *buffer = new char[buffer_size];

std::ifstream fin("input.dat");

while (fin)
{
    // Try to read next chunk of data
    fin.read(buffer, buffer_size);
    // Get the number of bytes actually read
    size_t count = fin.gcount();
    // If nothing has been read, break
    if (!count) 
        break;
    // Do whatever you need with first count bytes in the buffer
    // ...
}

delete[] buffer;

The buffer size of 32 bytes, as you are using, is definitely too small. 您正在使用的32个字节的缓冲区大小绝对太小。 You make too many calls to library functions (and the library, in turn, makes calls (although probably not every time) to OS, which are typically slow, since they cause context-switching). 您对库函数的调用太多(而库又(尽管可能并非每次)都对OS进行调用(尽管可能不是每次都这样),这通常很慢,因为它们会导致上下文切换)。 There is also no need of tell/seek. 也不需要告诉/寻求。

If you don't need all the file content simultaneously, reduce the working set first - like a set of about 32 words, but since XOR can be applied sequentially, you may further simplify the working set with constant size, like 4 kilo-bytes. 如果您不需要同时显示所有文件内容,请先减少工作集-大约32个单词,但是由于XOR可以顺序应用,因此您可以使用固定大小(例如4 KB)进一步简化工作集。

Now, you have the option to use file reader is.read() in a loop and process a small set of data each iteration, or use memmap() to map the file content as memory pointer which you can perform both read and write operations. 现在,您可以选择循环使用文件阅读器is.read()在每次迭代中处理少量数据,或者使用memmap()将文件内容映射为内存指针,您可以执行读取和写入操作。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM