高效解析mmap文件

Question

Following is the code for creating a memory map file using boost. 以下是使用boost创建内存映射文件的代码。

boost::iostreams::mapped_file_source file;  
boost::iostreams::mapped_file_params param;  
param.path = "\\..\\points.pts";  //! Filepath  
file.open(param, fileSize);  
if(file.is_open())  
{  
  //! Access the buffer and populate the ren point buffer  
  const char* pData = file.data();  
  char* pData1 = const_cast<char*>(pData);  //! this gives me all the data from Mmap file  
  std::vector<RenPoint> readPoints;  
  ParseData( pData1, readPoints);
}

The implementation of ParseData is as follows ParseData的实现如下

void ParseData ( char* pbuffer , std::vector<RenPoint>>& readPoints)    
{
  if(!pbuffer)
throw std::logic_error("no Data in memory mapped file");

stringstream strBuffer;
strBuffer << pbuffer;

//! Get the max number of points in the pts file
std::string strMaxPts;
std::getline(strBuffer,strMaxPts,'\n');
auto nSize = strMaxPts.size();
unsigned nMaxNumPts = GetValue<unsigned>(strMaxPts);
readPoints.clear();

//! Offset buffer 
pbuffer += nSize;
strBuffer << pbuffer;
std::string cur_line;
while(std::getline(strBuffer, cur_line,'\n'))
{
       //! How do I read the data from mmap file directly and populate my renpoint structure    
           int yy = 0;
}

//! Working but very slow
/*while (std::getline(strBuffer,strMaxPts,'\n'))
{
    std::vector<string> fragments;

    istringstream iss(strMaxPts);

    copy(istream_iterator<string>(iss),
        istream_iterator<string>(),
        back_inserter<vector<string>>(fragments));

    //! Logic to populate the structure after getting data back from fragments
    readPoints.push_back(pt);
}*/
}

I have say a minimum of 1 million points in my data structure and I want to optimize my parsing. 我已经说过我的数据结构中至少有100万个点，并且我想优化解析。 Any ideas ? 有任何想法吗？

Answer 1

read in header information to get the number of points 读入标题信息以获取点数
reserve space in a std::vector for N*num_points (N=3 assuming only X,Y,Z, 6 with normals, 9 with normals and rgb) 在std :: vector中为N * num_points保留空间（N = 3，仅假设X，Y，Z，法线为6，法线和rgb为9）
load the remainder of the file into a string 将文件的其余部分加载到字符串中
boost::spirit::qi::phrase_parse into the vector. 将boost :: spirit :: qi :: phrase_parse放入向量中。

//code here can parse a file with 40M points (> 1GB) in about 14s on my 2 year old macbook: //这里的代码可以在我2岁的Macbook上用大约14秒的时间解析一个具有40M点（> 1GB）的文件：

#include <boost/spirit/include/qi.hpp>
#include <fstream>
#include <vector>

template <typename Iter>
bool parse_into_vec(Iter p_it, Iter p_end, std::vector<float>& vf) {
    using boost::spirit::qi::phrase_parse;
    using boost::spirit::qi::float_;
    using boost::spirit::qi::ascii::space;

    bool ret = phrase_parse(p_it, p_end, *float_, space, vf);
    return p_it != p_end ? false : ret;
}

int main(int argc, char **args) {
    if(argc < 2) {
        std::cerr << "need a file" << std::endl;
        return -1;
    }
    std::ifstream in(args[1]);

    size_t numPoints;
    in >> numPoints;

    std::istreambuf_iterator<char> eos;
    std::istreambuf_iterator<char> it(in);
    std::string strver(it, eos);

    std::vector<float> vf;
    vf.reserve(3 * numPoints);

    if(!parse_into_vec(strver.begin(), strver.end(), vf)) {
        std::cerr << "failed during parsing" << std::endl;
        return -1;
    }

    return 0;
}

Answer 2

AFAICT, you're currently copying the entire contents of the file into strBuffer . AFAICT，您当前正在将文件的全部内容复制到strBuffer 。

What I think you want to do is use boost::iostreams::stream with your mapped_file_source instead. 我认为您想做的是改为使用boost::iostreams::stream和您的mapped_file_source 。

Here's an untested example, based on the linked documentation: 这是一个未经测试的示例，基于链接的文档：

// Create the stream
boost::iostreams::stream<boost::iostreams::mapped_file_source> str("some/path/file");
// Alternately, you can create the mapped_file_source separately and tell the stream to open it (using a copy of your mapped_file_source)
boost::iostreams::stream<boost::iostreams::mapped_file_source> str2;
str2.open(file);

// Now you can use std::getline as you normally would.
std::getline(str, strMaxPts);

As an aside, I'll note that by default mapped_file_source maps the entire file, so there's no need to pass the size explicitly. mapped_file_source说mapped_file_source ，我会注意到默认情况下， mapped_file_source映射整个文件，因此无需显式传递大小。

Answer 3

You can go with something like this (just a fast concept, you'll need to add some additional error checking etc.): 您可以使用类似这样的东西（只是一个快速的概念，您需要添加一些其他的错误检查等）：

#include "boost/iostreams/stream.hpp"
#include "boost/iostreams/device/mapped_file.hpp"
#include "boost/filesystem.hpp"
#include "boost/lexical_cast.hpp"

double parse_double(const std::string & str)
{
  double value = 0;
  bool decimal = false;
  double divisor = 1.0;
  for (std::string::const_iterator it = str.begin(); it != str.end(); ++it)
  {
    switch (*it)
    {
    case '.':
    case ',':
      decimal = true;
      break;
    default:
      {
        const int x = *it - '0';
        value = value * 10 + x;
        if (decimal)
          divisor *= 10;
      }
      break;
    }
  }
  return value / divisor;
}


void process_value(const bool initialized, const std::string & str, std::vector< double > & values)
{
  if (!initialized)
  {
    // convert the value count and prepare the output vector
    const size_t count = boost::lexical_cast< size_t >(str);
    values.reserve(count);
  }
  else
  {
    // convert the value
    //const double value = 0; // ~ 0:20 min
    const double value = parse_double(str); // ~ 0:35 min
    //const double value = atof(str.c_str()); // ~ 1:20 min
    //const double value = boost::lexical_cast< double >(str); // ~ 8:00 min ?!?!?
    values.push_back(value);
  }
}


bool load_file(const std::string & name, std::vector< double > & values)
{
  const int granularity = boost::iostreams::mapped_file_source::alignment();
  const boost::uintmax_t chunk_size = ( (256 /* MB */ << 20 ) / granularity ) * granularity;
  boost::iostreams::mapped_file_params in_params(name);
  in_params.offset = 0;
  boost::uintmax_t left = boost::filesystem::file_size(name);
  std::string value;
  bool whitespace = true;
  bool initialized = false;
  while (left > 0)
  {
    in_params.length = static_cast< size_t >(std::min(chunk_size, left));
    boost::iostreams::mapped_file_source in(in_params);
    if (!in.is_open())
      return false;
    const boost::iostreams::mapped_file_source::size_type size = in.size();
    const char * data = in.data();
    for (boost::iostreams::mapped_file_source::size_type i = 0; i < size; ++i, ++data)
    {
      const char c = *data;
      if (strchr(" \t\n\r", c))
      {
        // c is whitespace
        if (!whitespace)
        {
          whitespace = true;
          // finished previous value
          process_value(initialized, value, values);
          initialized = true;
          // start a new value
          value.clear();
        }
      }
      else
      {
        // c is not whitespace
        whitespace = false;
        // append the char to the value
        value += c;
      }
    }
    if (size < chunk_size)
      break;
    in_params.offset += chunk_size;
    left -= chunk_size;
  }
  if (!whitespace)
  {
    // convert the last value
    process_value(initialized, value, values);
  }
  return true;
}

Note that your main problem will be the conversion from string to float, which is very slow (insanely slow in the case of boost::lexical_cast). 请注意，您的主要问题将是从字符串到浮点的转换，这是非常缓慢的（对于boost :: lexical_cast来说，这是如此之慢）。 With my custom special parse_double func it is faster, however it only allows a special format (eg you'll need to add sign detection if negative values are allowed etc. - or you can just go with atof if all possible formats are needed). 使用我自定义的特殊parse_double func速度更快，但是它仅允许一种特殊格式（例如，如果允许负值，则需要添加符号检测等。或者，如果需要所有可能的格式，则可以只使用atof）。

If you'll want to parse the file faster, you'll probably need to go for multithreading - for example one thread only parsing the string values and other one or more threads converting the loaded string values to floats. 如果您想更快地解析文件，则可能需要进行多线程处理-例如，一个线程仅解析字符串值，而另一个线程则将加载的字符串值转换为浮点数。 In that case you probably won't even need the memory mapped file, as the regular buffered file read might suffice (the file will be read only once anyway). 在那种情况下，您可能甚至不需要内存映射文件，因为常规的缓冲文件读取可能就足够了（无论如何，该文件只能读取一次）。

Answer 4

A few quick comments on your code: 1) you're not reserving space for your vector so it's doing expansion every time you add a value. 关于代码的一些快速注释：1）您没有为向量保留空间，因此每次添加值时它都会进行扩展。 You have read the number of points from the file so call reserve(N) after the clear(). 您已经从文件中读取了点数，因此在clear（）之后调用reserve（N）。

2) you're forcing a map of the entire file in one hit which will work on 64 bits but is probably slow AND is forcing another allocation of the same amount of memory with strBuffer << pbuffer; 2）您正在一次命中整个文件的映射，该映射可以在64位上运行，但速度可能很慢，并且正在用strBuffer << pbuffer强制另一次分配相同数量的内存；

http://www.boost.org/doc/libs/1_53_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file.mapped_file_mapping_regions shows how to getRegion http://www.boost.org/doc/libs/1_53_0/doc/html/interprocess/sharedmemorybetweenprocesses.html#interprocess.sharedmemorybetweenprocesses.mapped_file.mapped_file_mapping_regions显示了如何获取区域

Use a loop through getRegion to load an estimated chunk of data containing many lines. 通过getRegion使用循环来加载包含许多行的估计数据块。 You are going to have to handle partial buffers - each getRegion will likely end with part of a line you need to preserve and join to the next partial buffer starting the next region. 您将必须处理部分缓冲区-每个getRegion都可能以您需要保留的行的一部分结尾，并连接到从下一个区域开始的下一个部分缓冲区。

高效解析mmap文件

问题描述

4 个解决方案

解决方案1
2 已采纳 2013-06-29 22:26:51

解决方案2
1 2013-06-24 10:54:27

解决方案3
1 2013-06-25 23:07:46

解决方案4
0 2013-06-25 17:49:05

高效解析mmap文件

问题描述

4 个解决方案

解决方案1 2 已采纳 2013-06-29 22:26:51

解决方案2 1 2013-06-24 10:54:27

解决方案3 1 2013-06-25 23:07:46

解决方案4 0 2013-06-25 17:49:05

解决方案1
2 已采纳 2013-06-29 22:26:51

解决方案2
1 2013-06-24 10:54:27

解决方案3
1 2013-06-25 23:07:46

解决方案4
0 2013-06-25 17:49:05