C ++：如何读取大型txt文件并更快地保存到数组中

Question

I want to read a large txt file which has more than 50000 lines. 我想读一个超过50000行的大型txt文件。

sample of the files: 文件样本：

John 12 1 956 02 818 912 322 43 161 9 002 768 23 79 9 1 115 7 2 18 59 58 989 3 56 82 59 147 86 62 06 10 538 36 694 952 71 0 2 5 67 103 6 295 933 428 9 70 708 6 73 449 57 283 6 48 139 5 140 34 5 9 95 74 892 9 387 172 44 05 67 534 7 79 5 565 417 252 480 22 503 089 76 433 93 36 374 97 035 70 2 896 0 3 0 259 93 92 47 860

description: Above sample is each line in the txt file. description：上面的示例是txt文件中的每一行。 each character and string are divided by space. 每个字符和字符串除以空格。

The goal: I want to save the value of integer after the first word (in this case: John) and save to a Intager Matrice whose row = number of line in the txt file and column = 100. 目标：我想在第一个单词（在本例中为John）之后保存整数值并保存到Intager Matrice，其行= txt文件中的行数，列= 100。

Here is my code 这是我的代码

Mat readInteger(String path_txt_file){  
int row = 1;
int col = 100;
Mat return_mat;

Mat tmp_mat= Mat(row, col, CV_32F);

fstream input(path_txt_file);
for (std::string line; getline(input, line);)
{
    int posMat = -1;
    vector<string> v = split<string>(line, " ");
    for (int i = 1; i < v.size(); i++)
    {   
        posMat = posMat + 1;
        tmp_mat.at<float>(0, posMat) = atoi(v[i].c_str()); //white  
    }
    return_mat.push_back(tmp_mat);
}
tmp_mat.release();
return return_mat;
}

Code description 代码说明

I followed the classical way of reading the data from txt file is read line by line 我按照经典的方式从txt文件读取数据逐行读取
I created two Mat, return_mat and tmp_mat 我创建了两个Mat，return_mat和tmp_mat
Each time for each line tmp_mat whose row = 1 and col = 100 is used to store integer each line then we split string according to whitespace after that we push the whole tmp_mat to return_mat. 每次对于行= 1和col = 100的每行tmp_mat用于存储每行的整数，然后我们根据空格分割字符串，之后我们将整个tmp_mat推送到return_mat。

Result I got the result I want; 结果我得到了我想要的结果; unfortunately , when the file is too big (and we need that). 不幸的是 ，当文件太大（我们需要）。 The process is too slow. 这个过程太慢了。

Question 题

How can we improve this algo. 我们怎样才能改善这个算法。 to deal with large file 1000000 lines for instance? 例如处理大文件1000000行？ I wonder if we should use multithreading? 我想知道我们是否应该使用多线程？

Thank 谢谢

Answer 1

I don't know if you have any say on how the original file is constructed but you could suggest some changes. 我不知道您是否对原始文件的构造有任何发言权，但您可以建议一些更改。 I don't think the reading is slow but all the casting is. 我不认为阅读速度慢，但所有的演员都是。 You first split the line which is slow and then you first cast it to an integer and then again to a float. 首先拆分缓慢的行，然后首先将其转换为整数然后再转换为浮点数。 Also you use the Mat.at function and as far as I know that isn't to fast either (could be wrong on that). 你也使用Mat.at函数，据我所知，这也不是很快（可能是错误的）。 Also pushing back a row to another mat is the same as doing a copy which takes time. 将一行推回到另一个垫也与进行需要时间的复制相同。 It's not a lot but it cumulates over time with big files. 它不是很多，但随着时间的推移累积大文件。

My suggestion is the following: 我的建议如下：

Create a struct looking like this: 创建一个如下所示的结构：

struct Data
{
    char[100] FirstWord;
    std::array<int, 100> Data;
}

Instead of creating a text file, you use a binary file and write this struct to it. 您可以使用二进制文件并将此结构写入其中，而不是创建文本文件。 (just look into writing to binary files: http://www.cplusplus.com/reference/ostream/ostream/write/ ) （只需看一下二进制文件的写作： http ： //www.cplusplus.com/reference/ostream/ostream/write/ ）

If you read the file again in you can do something like this: 如果您再次阅读该文件，您可以执行以下操作：

ifstream file ("File.dat", ios::in|ios::binary);
if (file.is_open())
{
    Data D;
    file.read(reinterpet_cast<char*>(&D), sizeof(D));
    Mat A(RowSize,ColSize,D.data());
}

In this way you don't need to do all the casting. 这样你就不需要做所有的施法。 You just need 1 copy. 你只需要1份。

Hope this helps 希望这可以帮助

C ++：如何读取大型txt文件并更快地保存到数组中

问题描述

1 个解决方案

解决方案1
1 2014-05-13 09:44:49

C ++：如何读取大型txt文件并更快地保存到数组中

问题描述

1 个解决方案

解决方案1 1 2014-05-13 09:44:49

解决方案1
1 2014-05-13 09:44:49