简体   繁体   English

在c ++中读取由文本文件中的行分隔的数百万个整数的最有效方法是什么

[英]What is the best efficient way to read millions of integers separated by lines from text file in c++

I have about 25 millions of integers separated by lines in my text file. 我的文本文件中有大约25百万个由行分隔的整数。 My first task is to take those integers and sort them. 我的第一个任务是采用那些整数并对它们进行排序。 I have actually achieved to read the integers and put them into an array (since my sorting function takes an unsorted array as an argument). 我实际上已经实现了读取整数并将它们放入数组中(因为我的排序函数将未排序的数组作为参数)。 However, this reading the integers from a file is a very long and an expensive process. 但是,从文件中读取整数是一个非常漫长且昂贵的过程。 I have searched many other solutions to get the cheaper and efficient way of doing this but I was not able to find one that tackles with such sizes. 我已经搜索了许多其他解决方案,以获得更便宜和有效的方式来做到这一点,但我找不到一个解决这种大小的解决方案。 Therefore, what would your suggestion be to read the integers from the huge (about 260MB) text file. 因此,您的建议是从巨大的(大约260MB)文本文件中读取整数。 And also how can I get the number of lines efficiently for the same problem. 而且我如何才能有效地获得相同问题的行数。

ifstream myFile("input.txt");

int currentNumber;
int nItems = 25000000;
int *arr = (int*) malloc(nItems*sizeof(*arr));
int i = 0;
while (myFile >> currentNumber)
{
    arr[i++] = currentNumber;
}

This is just how I get the integers from the text file. 这就是我从文本文件中获取整数的方法。 It is not that complicated. 它并不复杂。 I assumed the number of lines are fixed (actually it is fixed) 我假设行数是固定的(实际上是固定的)

By the way, it is not too slow of course. 顺便说一下,当然不是太慢。 It completes reading in approximately 9 seconds in OS X with 2.2GHz i7 processor. 它使用2.2GHz i7处理器在OS X中完成大约9秒的读取。 But I feel it could be much better. 但我觉得它会好得多。

Most likely, any optimisation on this is likely to have rather little effect. 最有可能的是,对此的任何优化都可能产生相当小的影响。 On my machine, the limiting factor for reading large files is the disk transfer speed. 在我的机器上,读取大文件的限制因素是磁盘传输速度。 Yes, improving the read speed can improve it a little bit, but most likely, you won't get very much from that. 是的,提高读取速度可以稍微提高一点,但最有可能的是,你不会从中获得很多。

I found in a previous test [I'll see if I can find the answer with that in it - I couldn't find the source in my "experiment code for SO" directory] that the fastest way is to load the file using mmap . 我在之前的测试中发现[我会看到我是否可以找到答案 - 我在“我的”实验代码“目录中找不到源代码]最快的方法是使用mmap加载文件。 But it's only marginally faster than using ifstream . 但它只比使用ifstream快一点。

Edit: my home-made benchmark for reading a file in a few different ways. 编辑:我的自制基准,用于以几种不同的方式读取文件。 getline while reading a file vs reading whole file and then splitting based on newline character getline在读取文件时读取整个文件然后根据换行符分割

As per usual, benchmarks measure what the benchmark measures, and small changes to either the environment or the way the code is written can sometimes make a big difference. 按照惯例,基准测量衡量基准测量的内容,对环境或代码编写方式的微小变化有时会产生很大的不同。

Edit: Here are a few implementations of "read a number from a file and store it in a vector": 编辑:以下是“从文件中读取数字并将其存储在矢量中”的一些实现:

#include <iostream>
#include <fstream>
#include <vector>
#include <sys/time.h>
#include <cstdio>
#include <cstdlib>
#include <cstring>
#include <sys/mman.h>
#include <sys/types.h>
#include <fcntl.h>


using namespace std;

const char *file_name = "lots_of_numbers.txt";

void func1()
{
    vector<int> v;
    int num;
    ifstream fin(file_name);
    while( fin >> num )
    {
    v.push_back(num);
    }
    cout << "Number of values read " << v.size() << endl;
}


void func2()
{
    vector<int> v;
    v.reserve(42336000);
    int num;

    ifstream fin(file_name);
    while( fin >> num )
    {
    v.push_back(num);
    }
    cout << "Number of values read " << v.size() << endl;
}

void func3()
{
    int *v = new int[42336000];
    int num;

    ifstream fin(file_name);
    int i = 0;
    while( fin >> num )
    {
    v[i++] = num;
    }
    cout << "Number of values read " << i << endl;
    delete [] v;
}


void func4()
{
    int *v = new int[42336000];
    FILE *f = fopen(file_name, "r");
    int num;
    int i = 0;
    while(fscanf(f, "%d", &num) == 1)
    {
    v[i++] = num;
    }
    cout << "Number of values read " << i << endl;
    fclose(f);
    delete [] v;
}    

void func5()
{
    int *v = new int[42336000];
    int num = 0;

    ifstream fin(file_name);
    char buffer[8192];
    int i = 0;
    int bytes = 0;
    char *p;
    int hasnum = 0;
    int eof = 0;
    while(!eof)
    {
    fin.read(buffer, sizeof(buffer));
    p = buffer;
    bytes = 8192;
    while(bytes > 0)
    {
        if (*p == 26)   // End of file marker...
        {
        eof = 1;
        break;
        }
        if (*p == '\n' || *p == ' ')
        {
        if (hasnum)
            v[i++] = num;
        num = 0;
        p++;
        bytes--;
        hasnum = 0;
        }
        else if (*p >= '0' &&  *p <= '9')
        {
        hasnum = 1;
        num *= 10;
        num += *p-'0';
        p++;
        bytes--;
        }
        else 
        {
        cout << "Error..." << endl;
        exit(1);
        }
    }
    memset(buffer, 26, sizeof(buffer));  // To detect end of files. 
    }
    cout << "Number of values read " << i << endl;
    delete [] v;
}

void func6()
{
    int *v = new int[42336000];
    int num = 0;

    FILE *f = fopen(file_name, "r");
    char buffer[8192];
    int i = 0;
    int bytes = 0;
    char *p;
    int hasnum = 0;
    int eof = 0;
    while(!eof)
    {
    fread(buffer, 1, sizeof(buffer), f);
    p = buffer;
    bytes = 8192;
    while(bytes > 0)
    {
        if (*p == 26)   // End of file marker...
        {
        eof = 1;
        break;
        }
        if (*p == '\n' || *p == ' ')
        {
        if (hasnum)
            v[i++] = num;
        num = 0;
        p++;
        bytes--;
        hasnum = 0;
        }
        else if (*p >= '0' &&  *p <= '9')
        {
        hasnum = 1;
        num *= 10;
        num += *p-'0';
        p++;
        bytes--;
        }
        else 
        {
        cout << "Error..." << endl;
        exit(1);
        }
    }
    memset(buffer, 26, sizeof(buffer));  // To detect end of files. 
    }
    fclose(f);
    cout << "Number of values read " << i << endl;
    delete [] v;
}


void func7()
{
    int *v = new int[42336000];
    int num = 0;

    FILE *f = fopen(file_name, "r");
    int ch;
    int i = 0;
    int hasnum = 0;
    while((ch = fgetc(f)) != EOF)
    {
    if (ch == '\n' || ch == ' ')
    {
        if (hasnum)
        v[i++] = num;
        num = 0;
        hasnum = 0;
    }
    else if (ch >= '0' &&  ch <= '9')
    {
        hasnum = 1;
        num *= 10;
        num += ch-'0';
    }
    else 
    {
        cout << "Error..." << endl;
        exit(1);
    }
    }
    fclose(f);
    cout << "Number of values read " << i << endl;
    delete [] v;
}


void func8()
{
    int *v = new int[42336000];
    int num = 0;

    int f = open(file_name, O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);
    char *buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);

    int i = 0;
    int hasnum = 0;
    int bytes = size;
    char *p = buffer;
    while(bytes > 0)
    {
    if (*p == '\n' || *p == ' ')
    {
        if (hasnum)
        v[i++] = num;
        num = 0;
        p++;
        bytes--;
        hasnum = 0;
    }
    else if (*p >= '0' &&  *p <= '9')
    {
        hasnum = 1;
        num *= 10;
        num += *p-'0';
        p++;
        bytes--;
    }
    else 
    {
        cout << "Error..." << endl;
        exit(1);
    }
    }
    close(f);
    munmap(buffer, size);
    cout << "Number of values read " << i << endl;
    delete [] v;
}






struct bm
{
    void (*f)();
    const char *name;
};

#define BM(f) { f, #f }

bm b[] = 
{
    BM(func1),
    BM(func2),
    BM(func3),
    BM(func4),
    BM(func5),
    BM(func6),
    BM(func7),
    BM(func8),
};


double time_to_double(timeval *t)
{
    return (t->tv_sec + (t->tv_usec/1000000.0)) * 1000.0;
}

double time_diff(timeval *t1, timeval *t2)
{
    return time_to_double(t2) - time_to_double(t1);
}



int main()
{
    for(int i = 0; i < sizeof(b) / sizeof(b[0]); i++)
    {
    timeval t1, t2;
    gettimeofday(&t1, NULL);
    b[i].f();
    gettimeofday(&t2, NULL);
    cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
    }
    for(int i = sizeof(b) / sizeof(b[0])-1; i >= 0; i--)
    {
    timeval t1, t2;
    gettimeofday(&t1, NULL);
    b[i].f();
    gettimeofday(&t2, NULL);
    cout << b[i].name << ": " << time_diff(&t1, &t2) << "ms" << endl;
    }
}

Results (two consecutive runs, forwards and backwards to avoid file-caching benefits): 结果(连续两次运行,向前和向后以避免文件缓存的好处):

Number of values read 42336000
func1: 6068.53ms
Number of values read 42336000
func2: 6421.47ms
Number of values read 42336000
func3: 5756.63ms
Number of values read 42336000
func4: 6947.56ms
Number of values read 42336000
func5: 941.081ms
Number of values read 42336000
func6: 962.831ms
Number of values read 42336000
func7: 2572.4ms
Number of values read 42336000
func8: 816.59ms
Number of values read 42336000
func8: 815.528ms
Number of values read 42336000
func7: 2578.6ms
Number of values read 42336000
func6: 948.185ms
Number of values read 42336000
func5: 932.139ms
Number of values read 42336000
func4: 6988.8ms
Number of values read 42336000
func3: 5750.03ms
Number of values read 42336000
func2: 6380.36ms
Number of values read 42336000
func1: 6050.45ms

In summary, as someone pointed out in the comments, the actual parsing of integers is quite a substantial part of the whole time, so reading the file isn't quite as critical as I first made out. 总之,正如有人在评论中指出的那样,整数的实际解析是整个时间的重要部分,因此阅读文件并不像我最初做的那样重要。 Even a very naive way of reading the file (using fgetc() beats the ifstream operator>> for integers. 即使是一种非常天真的读取文件的方式(使用fgetc()使用ifstream operator>>来获取整数。

As can be seen, using mmap to load the file is slightly faster than reading the file via fstream , but only marginally so. 可以看出,使用mmap加载文件比通过fstream读取文件要快一些,但只是略微如此。

You can use external sorting to sort values in your file without loading them all into memory. 您可以使用外部排序对文件中的值进行排序,而无需将它们全部加载到内存中。 Sorting speed will be limited by your hard drive capabilities, but you will be able to mess with really huge files. 排序速度将受到硬盘驱动器功能的限制,但您将能够处理真正庞大的文件。 Here is the implementation . 这是实施

It will be pretty straightforward with Qt: Qt会很简单:

QFile file("h:/1.txt");
file.open(QIODevice::ReadOnly);
QDataStream in(&file);

QVector<int> ints;
ints.reserve(25000000);

while (!in.atEnd()) {
    int integer;
    qint8 line; 
    in >> integer >> line; // read an int into integer, a char into line
    ints.append(integer); // append the integer to the vector
}

At the end, you have the ints QVector you can easily sort. 最后,你有ints QVector您可以轻松地进行排序。 The number of lines is the same as the size of the vector, provided the file was properly formatted. 如果文件格式正确,则行数与向量的大小相同。

On my machine, i7 3770k @4.2 Ghz, it takes about 490 milliseconds to read 25 million ints and put them into a vector. 在我的机器上,i7 3770k @ 4.2 Ghz,读取2500万个整数需要大约490毫秒并将它们放入一个矢量中。 Reading from a regular mechanical HDD, not SSD. 从普通的机械硬盘读取,而不是SSD。

Buffering the entire file into memory didn't help all that much, time dropped to 420 msec. 将整个文件缓冲到内存中并没有多大帮助,时间下降到420毫秒。

尝试读取整数块并解析这些块而不是逐行读取。

One possible solution would be dividing the large file into smaller chunks. 一种可能的解决方案是将大文件分成更小的块。 Sort each chunk separately and then merge all the sorted chunks one by one. 分别对每个块进行排序,然后逐个合并所有已排序的块。

EDIT: Apparently this is a well-established method. 编辑:显然这是一个成熟的方法。 See 'External merge sort' at http://en.wikipedia.org/wiki/External_sorting 请参阅http://en.wikipedia.org/wiki/External_sorting上的 “外部合并排序”

260MB is not that big. 260MB并不是那么大。 You should be able to load the whole thing into memory and then parse through it. 您应该能够将整个内容加载到内存中,然后通过它进行解析。 Once in you can use a nested loop to read the integers between line endings and convert using the usual functions. 进入后,您可以使用嵌套循环读取行结尾之间的整数,并使用常用函数进行转换。 I'd try and preallocate sufficient memory for your array of integers before you start. 在开始之前,我会尝试为你的整数数组预分配足够的内存。

Oh, and you may find the crude old C-style file access functions are the faster options for things like this. 哦,您可能会发现粗略的旧C风格文件访问功能是这类事情的更快选择。

You don't say how you are reading the values, so it's hard to say. 你没有说你是如何读取价值的,所以很难说。 Still, there are really only two solutions: `someIStream 实际上,实际上只有两种解决方案:`someItream

anInt and fscanf( someFd, "%d", &anInt )` Logically, these should have similar performance, but implementations vary; anInt and fscanf(someFd,“%d”,&anInt)`逻辑上,这些应该具有相似的性能,但实现方式各不相同; it might be worth trying and measuring both. 可能值得尝试和测量两者。

Another thing to check is how you're storing them. 要检查的另一件事是你如何存储它们。 If you know you have about 25 million, doing a reserve of 30 million on the std::vector before reading them would probably help. 如果你知道你有大约2500万,在阅读它们之前在std::vector上做3000万的reserve可能会有所帮助。 It might also be cheaper to construct the vector with 30 million elements, then trim it when you've seen the end, rather than using push_back . 构造具有3000万个元素的vector也可能更便宜,然后在看到结束时修剪它,而不是使用push_back

Finally, you might consider writing a immapstreambuf , and using that to mmap the input, and read it directly from the mapped memory. 最后,你可能会考虑写一个immapstreambuf ,并用它来mmap输入,并直接从映射内存读取它。 Or even iterating over it manually, calling strtol (but that's a lot more work); 或者甚至手动迭代它,调用strtol (但这是更多的工作); all of the streaming solutions probably end up calling strtol , or something similar, but doing significant work around the call first. 所有流媒体解决方案可能最终都会调用strtol或类似的东西,但首先要围绕调用做一些重要的工作。

EDIT: 编辑:

FWIW, I did some very quick tests on my home machine (a fairly recent LeNova, running Linux), and the results surprised me: FWIW,我在我的家用机器上做了一些非常快速的测试(一个相当新的LeNova,运行Linux),结果让我感到惊讶:

  • As a reference, I did the trivial, naïve implementation, using std::cin >> tmp and v.push_back( tmp ); 作为参考,我使用std::cin >> tmpv.push_back( tmp );完成了琐碎,天真的实现v.push_back( tmp ); , with no attempts to optimize. ,没有尝试优化。 On my system, this ran in just under 10 seconds. 在我的系统上,这只用了不到10秒。

  • Simple optimizations, such as using reserve on the vector, or initially creating the vector with a size of 25000000, didn't change much—the time was still over 9 seconds. 简单的优化,例如在向量上使用reserve ,或者最初创建大小为25000000的向量,并没有太大变化 - 时间仍然超过9秒。

  • Using a very simple mmapstreambuf , the time dropped to around 3 seconds—with the simplest loop, no reserve , etc. 使用一个非常简单的mmapstreambuf ,时间下降到大约3秒 - 最简单的循环,没有reserve等。

  • Using fscanf , the time dropped to just under 3 seconds. 使用fscanf ,时间下降到不到3秒。 I suspect that the Linux implementation of FILE* also uses mmap (and std::filebuf doesn't). 我怀疑FILE*的Linux实现也使用mmap (而std::filebuf没有)。

  • Finally, using a mmapbuffer , iterating with two char* , and using stdtol to convert, the time dropped to under a second, 最后,使用mmapbuffer ,使用两个char*迭代,并使用stdtol进行转换,时间降至一秒以下,

These tests were done very quickly (less than an hour to write and run all of them), and are far from rigorous (and of course, don't tell you anything about other environments), but the differences surprised me. 这些测试很快完成(编写和运行所有这些测试不到一个小时),并且远非严格(当然,不要告诉你有关其他环境的任何信息),但这些差异让我感到惊讶。 I didn't expect as much difference. 我没想到差别太大。

I would do it this way : 我会这样做:

#include <fstream>
#include <iostream>
#include <string>

using namespace std;

int main() {

    fstream file;
    string line;
    int intValue;
    int lineCount = 0;
    try {
        file.open("myFile.txt", ios_base::in); // Open to read
        while(getline(file, line)) {
            lineCount++;
            try {
                intValue = stoi(line);
                // Do something with your value
                cout << "Value for line " << lineCount << " : " << intValue << endl;

            } catch (const exception& e) {
                cerr << "Failed to convert line " << lineCount << " to an int : " << e.what() << endl;
            }
        }
    } catch (const exception& e) {
        cerr << e.what() << endl;
        if (file.is_open()) {
            file.close();
        }
    }

    cout << "Line count : " << lineCount << endl;

    system("PAUSE");
}

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM