用c ++将文件读入内存的最快方法？

Question

我试图以更快的方式从文件中读取。 我正在做的当前方式如下，但对于大文件来说速度非常慢。 我想知道是否有更快的方法来做到这一点？ 我需要存储结构的值，我已在下面定义。

std::vector<matEntry> matEntries;
inputfileA.open(matrixAfilename.c_str());

// Read from file to continue setting up sparse matrix A
while (!inputfileA.eof()) {
    // Read row, column, and value into vector
    inputfileA >> (int) row; // row
    inputfileA >> (int) col; // col
    inputfileA >> val;       // value

    // Add row, column, and value entry to the matrix
    matEntries.push_back(matEntry());
    matEntries[index].row = row-1;
    matEntries[index].col = col-1;
    matEntries[index].val = val;

    // Increment index
    index++;
}

我的结构：

struct matEntry {
    int row;
    int col;
    float val;
};

该文件的格式如下（int，int，float）：

更多信息：

我知道运行时文件中的行数。
我正面临瓶颈。 分析器说while（）循环是瓶颈。

Answer 1

为了简化操作，我为你的struct定义了一个输入流操作符。

std::istream& operator>>(std::istream& is, matEntry& e)
{
    is >> e.row >> e.col >> e.val;
    e.row -= 1;
    e.col -= 1;

    return is;
}

关于速度，如果没有达到非常基本的文件IO级别，就没有太大的改进。 我认为你唯一能做的就是初始化你的向量，这样它就不会在循环中一直调整大小。 使用定义的输入流操作符，它看起来也更清晰：

std::vector<matEntry> matEntries;
matEntries.resize(numberOfLines);
inputfileA.open(matrixAfilename.c_str());

// Read from file to continue setting up sparse matrix A
while(index < numberOfLines && (is >> matEntries[index++]))
{  }

Answer 2

正如评论中所建议的那样，您应该在尝试优化之前对代码进行概要分析。 如果你想尝试随机的东西，直到表现足够好，你可以先尝试将其读入内存。 这是一个简单的例子，其中包含一些基本的分析：

#include <vector>
#include <ctime>
#include <fstream>
#include <sstream>
#include <iostream>

// Assuming something like this...
struct matEntry
{
    int row, col;
    double val;
};

std::istream& operator << ( std::istream& is, matEntry& e )
{ 
    is >> matEntry.row >> matEntry.col >> matEntry.val;
    matEntry.row -= 1;
    matEntry.col -= 1;
    return is;
}


std::vector<matEntry> ReadMatrices( std::istream& stream )
{
    auto matEntries = std::vector<matEntry>();

    auto e = matEntry();
    // For why this is better than your EOF test, see https://isocpp.org/wiki/faq/input-output#istream-and-while
    while( stream >> e ) {
        matEntries.push_back( e );
    }
    return matEntries;
}

int main()
{
    const auto time0 = std::clock();

    // Read file a piece at a time
    std::ifstream inputFileA( "matFileA.txt" );
    const auto matA = ReadMatrices( inputFileA );

    const auto time1 = std::clock();

    // Read file into memory (from http://stackoverflow.com/a/2602258/201787)
    std::ifstream inputFileB( "matFileB.txt" );
    std::stringstream buffer;
    buffer << inputFileB.rdbuf();
    const auto matB = ReadMatrices( buffer );

    const auto time2 = std::clock();
    std::cout << "A: " << ((time1 - time0) * CLOCKS_PER_SEC) << "  B: " << ((time2 - time1) * CLOCKS_PER_SEC) << "\n";
    std::cout << matA.size() << " " << matB.size();
}

请注意连续两次读取磁盘上的相同文件，因为磁盘缓存可能会隐藏性能差异。

其他选择包括：

在矢量中预分配空间（可能根据文件大小或其他内容添加文件格式的大小或估计它）
将文件格式更改为二进制或压缩数据，以最大限度地缩短读取时间
内存映射文件
并行化（简单：在单独的线程中处理文件A和文件B [请参阅std::async() ]; medium ：管道，以便在不同的线程上完成读取和转换; 硬盘：在不同的线程中处理相同的文件）

其他更高级别的考虑可能包括：

看起来你有一个4-D数据阵列（2D矩阵的行/列）。 在许多应用中，这是一个错误。 花点时间重新考虑一下这个数据结构是否真的符合您的需求。
有许多高质量的矩阵库可用（例如， Boost.QVM ， Blaze等）。 使用它们而不是重新发明轮子。

Answer 3

根据我的经验，这种代码中最慢的部分是解析数值（特别是浮点值）。 因此，您的代码很可能受CPU限制，可以通过并行化加速，如下所示：

假设您的数据在N行上并且您将使用k个线程处理它，则每个线程将必须处理大约[ N / k ]行。

mmap()文件。
扫描整个文件以获取换行符号，并确定要分配给每个线程的范围。
让每个线程通过使用包装内存缓冲区的std::istream的实现来并行处理其范围。

请注意，这将要求确保填充数据结构的代码是线程安全的。

用c ++将文件读入内存的最快方法？

问题描述

3 个解决方案

解决方案1
3 2016-11-18 19:23:51

解决方案2
2 2016-11-18 19:31:37

解决方案3
2 2016-11-18 19:36:26

用c ++将文件读入内存的最快方法？

问题描述

3 个解决方案

解决方案1 3 2016-11-18 19:23:51

解决方案2 2 2016-11-18 19:31:37

解决方案3 2 2016-11-18 19:36:26

解决方案1
3 2016-11-18 19:23:51

解决方案2
2 2016-11-18 19:31:37

解决方案3
2 2016-11-18 19:36:26