简体   繁体   中英

What is the best efficient way to read millions of integers separated by lines from text file in c++

I have about 25 millions of integers separated by lines in my text file. My first task is to take those integers and sort them. I have actually achieved to read the integers and put them into an array (since my sorting function takes an unsorted array as an argument). However, this reading the integers from a file is a very long and an expensive process. I have searched many other solutions to get the cheaper and efficient way of doing this but I was not able to find one that tackles with such sizes. Therefore, what would your suggestion be to read the integers from the huge (about 260MB) text file. And also how can I get the number of lines efficiently for the same problem.

ifstream myFile("input.txt");

int currentNumber;
int nItems = 25000000;
int *arr = (int*) malloc(nItems*sizeof(*arr));
int i = 0;
while (myFile >> currentNumber)
{
    arr[i++] = currentNumber;
}

This is just how I get the integers from the text file. It is not that complicated. I assumed the number of lines are fixed (actually it is fixed)

By the way, it is not too slow of course. It completes reading in approximately 9 seconds in OS X with 2.2GHz i7 processor. But I feel it could be much better.

Most likely, any optimisation on this is likely to have rather little effect. On my machine, the limiting factor for reading large files is the disk transfer speed. Yes, improving the read speed can improve it a little bit, but most likely, you won't get very much from that.

I found in a previous test [I'll see if I can find the answer with that in it - I couldn't find the source in my "experiment code for SO" directory] that the fastest way is to load the file using mmap . But it's only marginally faster than using ifstream .

Edit: my home-made benchmark for reading a file in a few different ways. getline while reading a file vs reading whole file and then splitting based on newline character

As per usual, benchmarks measure what the benchmark measures, and small changes to either the environment or the way the code is written can sometimes make a big difference.

Edit: Here are a few implementations of "read a number from a file and store it in a vector":

#include <iostream>
#include <fstream>
#include <vector>
#include <sys/time.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>


using namespace std;

const char *file_name = "lots_of_numbers.txt";

void func1()
{
    vector<int> v;
    int num;
    ifstream fin(file_name);
    while( fin >> num )
    {
    v.push_back(num);
    }
    cout << "Number of values read " << v.size() << endl;
}


void func2()
{
    vector<int> v;
    v.reserve(42336000);
    int num;

    ifstream fin(file_name);
    while( fin >> num )
    {
    v.push_back(num);
    }
    cout << "Number of values read " << v.size() << endl;
}

void func3()
{
    int *v = new int[42336000];
    int num;

    ifstream fin(file_name);
    int i = 0;
    while( fin >> num )
    {
    v[i++] = num;
    }
    cout << "Number of values read " << i << endl;
    delete [] v;
}


void func4()
{
    int *v = new int[42336000];
    FILE *f = fopen(file_name, "r");
    int num;
    int i = 0;
    while(fscanf(f, "%d", &num) == 1)
    {
    v[i++] = num;
    }
    cout << "Number of values read " << i << endl;
    fclose(f);
    delete [] v;
}    

void func5()
{
    int *v = new int[42336000];
    int num = 0;

    ifstream fin(file_name);
    char buffer[8192];
    int i = 0;
    int bytes = 0;
    char *p;
    int hasnum = 0;
    int eof = 0;
    while(!eof)
    {
    fin.read(buffer, sizeof(buffer));
    p = buffer;
    bytes = 8192;
    while(bytes > 0)
    {
        if (*p == 26)   // End of file marker...
        {
        eof = 1;
        break;
        }
        if (*p == '\n' || *p == ' ')
        {
        if (hasnum)
            v[i++] = num;
        num = 0;
        p++;
        bytes--;
        hasnum = 0;
        }
        else if (*p >= '0' &&  *p <= '9')
        {
        hasnum = 1;
        num *= 10;
        num += *p-'0';
        p++;
        bytes--;
        }
        else 
        {
        cout << "Error..." << endl;
        exit(1);
        }
    }
    memset(buffer, 26, sizeof(buffer));  // To detect end of files. 
    }
    cout << "Number of values read " << i << endl;
    delete [] v;
}

void func6()
{
    int *v = new int[42336000];
    int num = 0;

    FILE *f = fopen(file_name, "r");
    char buffer[8192];
    int i = 0;
    int bytes = 0;
    char *p;
    int hasnum = 0;
    int eof = 0;
    while(!eof)
    {
    fread(buffer, 1, sizeof(buffer), f);
    p = buffer;
    bytes = 8192;
    while(bytes > 0)
    {
        if (*p == 26)   // End of file marker...
        {
        eof = 1;
        break;
        }
        if (*p == '\n' || *p == ' ')
        {
        if (hasnum)
            v[i++] = num;
        num = 0;
        p++;
        bytes--;
        hasnum = 0;
        }
        else if (*p >= '0' &&  *p <= '9')
        {
        hasnum = 1;
        num *= 10;
        num += *p-'0';
        p++;
        bytes--;
        }
        else 
        {
        cout << "Error..." << endl;
        exit(1);
        }
    }
    memset(buffer, 26, sizeof(buffer));  // To detect end of files. 
    }
    fclose(f);
    cout << "Number of values read " << i << endl;
    delete [] v;
}


void func7()
{
    int *v = new int[42336000];
    int num = 0;

    FILE *f = fopen(file_name, "r");
    int ch;
    int i = 0;
    int hasnum = 0;
    while((ch = fgetc(f)) != EOF)
    {
    if (ch == '\n' || ch == ' ')
    {
        if (hasnum)
        v[i++] = num;
        num = 0;
        hasnum = 0;
    }
    else if (ch >= '0' &&  ch <= '9')
    {
        hasnum = 1;
        num *= 10;
        num += ch-'0';
    }
    else 
    {
        cout << "Error..." << endl;
        exit(1);
    }
    }
    fclose(f);
    cout << "Number of values read " << i << endl;
    delete [] v;
}


void func8()
{
    int *v = new int[42336000];
    int num = 0;

    int f = open(file_name, O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);
    char *buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);

    int i = 0;
    int hasnum = 0;
    int bytes = size;
    char *p = buffer;
    while(bytes > 0)
    {
    if (*p == '\n' || *p == ' ')
    {
        if (hasnum)
        v[i++] = num;
        num = 0;
        p++;
        bytes--;
        hasnum = 0;
    }
    else if (*p >= '0' &&  *p <= '9')
    {
        hasnum = 1;
        num *= 10;
        num += *p-'0';
        p++;
        bytes--;
    }
    else 
    {
        cout << "Error..." << endl;
        exit(1);
    }
    }
    close(f);
    munmap(buffer, size);
    cout << "Number of values read " << i << endl;
    delete [] v;
}






struct bm
{
    void (*f)();
    const char *name;
};

#define BM(f) { f, #f }

bm b[] = 
{
    BM(func1),
    BM(func2),
    BM(func3),
    BM(func4),
    BM(func5),
    BM(func6),
    BM(func7),
    BM(func8),
};


double time_to_double(timeval *t)
{
    return (t->tv_sec + (t->tv_usec/1000000.0)) * 1000.0;
}

double time_diff(timeval *t1, timeval *t2)
{
    return time_to_double(t2) - time_to_double(t1);
}



int main()
{
    for(int i = 0; i < sizeof(b) / sizeof(b[0]); i++)
    {
    timeval t1, t2;
    gettimeofday(&t1, NULL);
    b[i].f();
    gettimeofday(&t2, NULL);
    cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
    }
    for(int i = sizeof(b) / sizeof(b[0])-1; i >= 0; i--)
    {
    timeval t1, t2;
    gettimeofday(&t1, NULL);
    b[i].f();
    gettimeofday(&t2, NULL);
    cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
    }
}

Results (two consecutive runs, forwards and backwards to avoid file-caching benefits):

Number of values read 42336000
func1: 6068.53ms
Number of values read 42336000
func2: 6421.47ms
Number of values read 42336000
func3: 5756.63ms
Number of values read 42336000
func4: 6947.56ms
Number of values read 42336000
func5: 941.081ms
Number of values read 42336000
func6: 962.831ms
Number of values read 42336000
func7: 2572.4ms
Number of values read 42336000
func8: 816.59ms
Number of values read 42336000
func8: 815.528ms
Number of values read 42336000
func7: 2578.6ms
Number of values read 42336000
func6: 948.185ms
Number of values read 42336000
func5: 932.139ms
Number of values read 42336000
func4: 6988.8ms
Number of values read 42336000
func3: 5750.03ms
Number of values read 42336000
func2: 6380.36ms
Number of values read 42336000
func1: 6050.45ms

In summary, as someone pointed out in the comments, the actual parsing of integers is quite a substantial part of the whole time, so reading the file isn't quite as critical as I first made out. Even a very naive way of reading the file (using fgetc() beats the ifstream operator>> for integers.

As can be seen, using mmap to load the file is slightly faster than reading the file via fstream , but only marginally so.

You can use external sorting to sort values in your file without loading them all into memory. Sorting speed will be limited by your hard drive capabilities, but you will be able to mess with really huge files. Here is the implementation .

It will be pretty straightforward with Qt:

QFile file("h:/1.txt");
file.open(QIODevice::ReadOnly);
QDataStream in(&file);

QVector<int> ints;
ints.reserve(25000000);

while (!in.atEnd()) {
    int integer;
    qint8 line; 
    in >> integer >> line; // read an int into integer, a char into line
    ints.append(integer); // append the integer to the vector
}

At the end, you have the ints QVector you can easily sort. The number of lines is the same as the size of the vector, provided the file was properly formatted.

On my machine, i7 3770k @4.2 Ghz, it takes about 490 milliseconds to read 25 million ints and put them into a vector. Reading from a regular mechanical HDD, not SSD.

Buffering the entire file into memory didn't help all that much, time dropped to 420 msec.

尝试读取整数块并解析这些块而不是逐行读取。

One possible solution would be dividing the large file into smaller chunks. Sort each chunk separately and then merge all the sorted chunks one by one.

EDIT: Apparently this is a well-established method. See 'External merge sort' at http://en.wikipedia.org/wiki/External_sorting

260MB is not that big. You should be able to load the whole thing into memory and then parse through it. Once in you can use a nested loop to read the integers between line endings and convert using the usual functions. I'd try and preallocate sufficient memory for your array of integers before you start.

Oh, and you may find the crude old C-style file access functions are the faster options for things like this.

You don't say how you are reading the values, so it's hard to say. Still, there are really only two solutions: `someIStream

anInt and fscanf( someFd, "%d", &anInt )` Logically, these should have similar performance, but implementations vary; it might be worth trying and measuring both.

Another thing to check is how you're storing them. If you know you have about 25 million, doing a reserve of 30 million on the std::vector before reading them would probably help. It might also be cheaper to construct the vector with 30 million elements, then trim it when you've seen the end, rather than using push_back .

Finally, you might consider writing a immapstreambuf , and using that to mmap the input, and read it directly from the mapped memory. Or even iterating over it manually, calling strtol (but that's a lot more work); all of the streaming solutions probably end up calling strtol , or something similar, but doing significant work around the call first.

EDIT:

FWIW, I did some very quick tests on my home machine (a fairly recent LeNova, running Linux), and the results surprised me:

  • As a reference, I did the trivial, naïve implementation, using std::cin >> tmp and v.push_back( tmp ); , with no attempts to optimize. On my system, this ran in just under 10 seconds.

  • Simple optimizations, such as using reserve on the vector, or initially creating the vector with a size of 25000000, didn't change much—the time was still over 9 seconds.

  • Using a very simple mmapstreambuf , the time dropped to around 3 seconds—with the simplest loop, no reserve , etc.

  • Using fscanf , the time dropped to just under 3 seconds. I suspect that the Linux implementation of FILE* also uses mmap (and std::filebuf doesn't).

  • Finally, using a mmapbuffer , iterating with two char* , and using stdtol to convert, the time dropped to under a second,

These tests were done very quickly (less than an hour to write and run all of them), and are far from rigorous (and of course, don't tell you anything about other environments), but the differences surprised me. I didn't expect as much difference.

I would do it this way :

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

int main() {

    fstream file;
    string line;
    int intValue;
    int lineCount = 0;
    try {
        file.open("myFile.txt", ios_base::in); // Open to read
        while(getline(file, line)) {
            lineCount++;
            try {
                intValue = stoi(line);
                // Do something with your value
                cout << "Value for line " << lineCount << " : " << intValue << endl;

            } catch (const exception& e) {
                cerr << "Failed to convert line " << lineCount << " to an int : " << e.what() << endl;
            }
        }
    } catch (const exception& e) {
        cerr << e.what() << endl;
        if (file.is_open()) {
            file.close();
        }
    }

    cout << "Line count : " << lineCount << endl;

    system("PAUSE");
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM