简体   繁体   中英

Efficient Way to Read/Write Vectors to File

Overview: After several days of research, I have been unable to find a fast, efficient way to write/read a vector to/from a file. The majority of answers that I have seen involve pushing/popping each individual element in/out of the file. This is incredibly time consuming as the number of elements increase. Additionally, I have been unable to find an attempt at an answer to my specific problem. So please, make sure that your solution will work for my specific circumstance (ie read the entire question).

My Problem: I have a really large data structure that contains pixel information about images. There are 60,000 images with 784 pixels each. Each picture is an image of a handwritten digit. So, in addition to the 60,000 * 784 pixels, I need to include a label so I know which digit the image represents. The label that I use, which is necessary when looked at in the scope of the entire project, is a vector of 10 possibilities, representing a 0, 1, 2 ... 9, only one of which contains a '1'/'true' while the rest are '0'/'false'. Additionally, this data structure, due to linear algebra requirements throughout the rest of the project, requires that the information be stored in a 'Col' structure utilized in the Armadillo Linear Algebra Library. So, the structure that I wish to save/read in/from a file is declared as std::vector<std::vector<arma::Col<double>>> .

Here is the function that I am using to save the data right now, to give context:

void SaveTrainingData(vector<vector<Col<double>>> trainingData) //format: trainingData[60000][2][784, 10]
{
    ofstream ofile("VectorizedTrainingData.dat", ios::binary);

    for (int i = 0; i < trainingData.size(); i++)
        for (int j = 0; j < trainingData[i].size(); j++)
            for (int k = 0; k < trainingData[i][j].size(); k++)
                ofile.write((char *)&trainingData[i][j][k], sizeof(double));
}

If you have any questions, please do not hesitate to ask! Thanks in advance.

I haven't used Armadillo, but since a Col is a 1xN matrix and that should be stored linearly, you can get rid of the k loop and write out the entire column in one go:

ofile.write((char *)&trainingData[i][j][0], sizeof(double) * trainingData[i][j].size());

If that won't work, copy the elements from the Col to a local vector then write those out to the file (since the file operation will be much slower than copying some doubles around).

You probably also want to write out the size of your vector before writing all your elements so you know how many there are to read in.

I had to look up documentation on this Armadillo library, but it appears like Col is a contiguous, dense vector class. We can depend on the contiguous representation to eliminate a nested loop, like so:

// format: trainingData[60000][2][784, 10]
void SaveTrainingData(const vector<vector<Col<double>>>& trainingData) 
{
    ofstream ofile("VectorizedTrainingData.dat", ios::binary);

    const int numImages = trainingData.size();
    for (int i = 0; i < numImages; i++)
    {
        const vector<Col<double>>& img = trainingData[i];
        const int numCols = img.size();
        for (int j = 0; j < numCols; j++)
        {
            const Col<double>& col = img[j];
            ofile.write((char*)&col[0], col.size()*sizeof(double));
        }
    }
}

The reduced frequency in calls to write from one element in a column to a whole column may already help a bit.

It may be worth measuring this to make sure you're actually more I/O bound instead of memory bound. It's a little tricky with the potential memory fragmentation involving all these vectors of vectors of columns.

If the size of the inner vector is always the same (which seems to be the case with every image being 784 pixels), for example, you might be able to get potentially better results with a contiguous vector<Col> , or this:

struct Image
{
     Col pixels[768];
};
...
vector<Image> trainingData;

... or something like that. . I couldn't quite follow how the linear algebra ties in to the image rep, but hopefully this gives an idea.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM