简体   繁体   English

如何读/写向量 <Chunk*> 作为内存映射文件?

[英]How to read/write vector<Chunk*> as memory mapped file(s)?

I have a large set of data chunks (~50GB). 我有大量的数据块(约50GB)。 In my code I have to be able to do the following things: 在我的代码中,我必须能够执行以下操作:

  1. Repeatedly iterate over all chunks and do some computations on them. 重复遍历所有块并对其进行一些计算。

  2. Repeatedly iterate over all chunks and do some computations on them, where in each iteration the order of visited chunks is (as far as possible) randomized. 重复迭代所有块并对其进行一些计算,其中在每次迭代中,(尽可能)将访问的块的顺序随机化。

So far, I have split the data into 10 binary files (created with boost::serialization ) and repeatedly read one after the other and perform the computations. 到目前为止,我已经将数据拆分为10个二进制文件(使用boost::serialization创建),然后一个接一个地重复读取并执行计算。 For (2), I read the 10 files in random order and process each one in sequence, which is good enough. 对于(2),我以随机顺序读取10个文件,并依次处理每个文件,这已经足够了。

However, reading the one of the files (using boost::serialization ) takes a long time and I'd like to speed it up. 但是,读取其中一个文件(使用boost::serialization )需要很长时间,我想加快速度。

Can I use memory mapped files instead of boost::serialization ? 我可以使用内存映射文件代替boost::serialization吗?

In particular, I'd have a vector<Chunk*> in each file. 特别是,每个文件中都有一个vector<Chunk*> I want to be able to read in such a file very, very quickly. 我希望能够非常非常快地读取此类文件。

How can I read/write such a vector<Chunk*> data structure? 如何读取/写入这样的vector<Chunk*>数据结构? I have looked at boost::interprocess::file_mapping , but I'm not sure how to do it. 我看过boost::interprocess::file_mapping ,但是我不确定该怎么做。

I read this ( http://boost.cowic.de/rc/pdf/interprocess.pdf ), but it doesn't say much about memory mapped files. 我读了这篇文章( http://boost.cowic.de/rc/pdf/interprocess.pdf ),但是关于内存映射文件并没有说太多。 I think I'd store the vector<Chunk*> first in the mapped memory, then store the Chunks themselves. 我想我先将vector<Chunk*>存储在映射的内存中,然后再存储块本身。 And, vector<Chunk*> would actually become offset_ptr<Chunk>* , ie, an array of offset_ptr? 而且, vector<Chunk*>实际上会变成offset_ptr<Chunk>* ,即offset_ptr的数组吗?

A memory mapped file is a chunk of memory, as any other memory it may be organized in bytes, little endian words, bits, or any other data structure. 内存映射文件是一块内存,就像任何其他内存一样,它可以按字节,小尾数字,位或任何其他数据结构来组织。 If portability is a concern (eg endianness) some care is needed. 如果需要考虑可移植性(例如字节序),则需要格外小心。

The following code may be a good starting point: 以下代码可能是一个很好的起点:

#include <cstdint>
#include <memory>
#include <vector>
#include <iostream>
#include <boost/iostreams/device/mapped_file.hpp>

struct entry {
  std::uint32_t a;
  std::uint64_t b;
} __attribute__((packed)); /* compiler specific, but supported 
                              in other ways by all major compilers */

static_assert(sizeof(entry) == 12, "entry: Struct size mismatch");
static_assert(offsetof(entry, a) == 0, "entry: Invalid offset for a");
static_assert(offsetof(entry, b) == 4, "entry: Invalid offset for b");

int main(void) {
  boost::iostreams::mapped_file_source mmap("map");
  assert(mmap.is_open());
  const entry* data_begin = reinterpret_cast<const entry*>(mmap.data());
  const entry* data_end = data_begin + mmap.size()/sizeof(entry);
  for(const entry* ii=data_begin; ii!=data_end; ++ii)
    std::cout << std::hex << ii->a << " " << ii->b << std::endl;
  return 0;
}

The data_begin and data_end pointers can be used with most STL functions as any other iterator. data_begin和data_end指针可与大多数其他迭代器一样与大多数STL函数一起使用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM