简体   繁体   中英

C++: How to read large txt file and save to array faster

I want to read a large txt file which has more than 50000 lines.

sample of the files:

John 12 1 956 02 818 912 322 43 161 9 002 768 23 79 9 1 115 7 2 18 59 58 989 3 56 82 59 147 86 62 06 10 538 36 694 952 71 0 2 5 67 103 6 295 933 428 9 70 708 6 73 449 57 283 6 48 139 5 140 34 5 9 95 74 892 9 387 172 44 05 67 534 7 79 5 565 417 252 480 22 503 089 76 433 93 36 374 97 035 70 2 896 0 3 0 259 93 92 47 860

description: Above sample is each line in the txt file. each character and string are divided by space.

The goal: I want to save the value of integer after the first word (in this case: John) and save to a Intager Matrice whose row = number of line in the txt file and column = 100.

Here is my code

Mat readInteger(String path_txt_file){  
int row = 1;
int col = 100;
Mat return_mat;

Mat tmp_mat= Mat(row, col, CV_32F);

fstream input(path_txt_file);
for (std::string line; getline(input, line);)
{
    int posMat = -1;
    vector<string> v = split<string>(line, " ");
    for (int i = 1; i < v.size(); i++)
    {   
        posMat = posMat + 1;
        tmp_mat.at<float>(0, posMat) = atoi(v[i].c_str()); //white  
    }
    return_mat.push_back(tmp_mat);
}
tmp_mat.release();
return return_mat;
}

Code description

  1. I followed the classical way of reading the data from txt file is read line by line
  2. I created two Mat, return_mat and tmp_mat
  3. Each time for each line tmp_mat whose row = 1 and col = 100 is used to store integer each line then we split string according to whitespace after that we push the whole tmp_mat to return_mat.

Result I got the result I want; unfortunately , when the file is too big (and we need that). The process is too slow.

Question

How can we improve this algo. to deal with large file 1000000 lines for instance? I wonder if we should use multithreading?

Thank

I don't know if you have any say on how the original file is constructed but you could suggest some changes. I don't think the reading is slow but all the casting is. You first split the line which is slow and then you first cast it to an integer and then again to a float. Also you use the Mat.at function and as far as I know that isn't to fast either (could be wrong on that). Also pushing back a row to another mat is the same as doing a copy which takes time. It's not a lot but it cumulates over time with big files.

My suggestion is the following:

Create a struct looking like this:

struct Data
{
    char[100] FirstWord;
    std::array<int, 100> Data;
}

Instead of creating a text file, you use a binary file and write this struct to it. (just look into writing to binary files: http://www.cplusplus.com/reference/ostream/ostream/write/ )

If you read the file again in you can do something like this:

ifstream file ("File.dat", ios::in|ios::binary);
if (file.is_open())
{
    Data D;
    file.read(reinterpet_cast<char*>(&D), sizeof(D));
    Mat A(RowSize,ColSize,D.data());
}

In this way you don't need to do all the casting. You just need 1 copy.

Hope this helps

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM