简体   繁体   English

流式Boost.Serialization存档

[英]Streaming Boost.Serialization archives

I have a large dataset (100k+ items) I want to serialize using Boost.Serialization. 我有一个很大的数据集(超过10万个项目),我想使用Boost.Serialization进行序列化。 This works satisfactory. 这工作令人满意。

Now when working with even larger datasets the entire set doesn't fit into the memory anymore (I currently store a std::map with all data in the archive). 现在,当使用更大的数据集时,整个集合不再适合内存(我目前将std::map及其所有数据存储在存档中)。 Since I neither need random reads or writes and only need to access one item at a time I thought about streaming the dataset by directly saving instances to the archive ( archive << item1 << item2 ... ) and unpacking them one-by-one. 由于我既不需要随机读取或写入,也一次只需要访问一项,所以我想到了通过直接将实例保存到存档( archive << item1 << item2 ... )并逐一解archive << item1 << item2 ...它们来流式处理数据集的archive << item1 << item2 ...一。

The other option would be to develop a new file format from scratch (something simple like <length><block> where each <block> corresponds to one Boost.Serialization archive), because I noticed that it doesn't seem possible to detect the end of an archive in Boost.Serialization without catching exceptions ( input_stream_error should be thrown on a read past the end of the archive, I think). 另一种选择是从头开始开发新的文件格式(类似<length><block>简单方法,其中每个<block>对应于一个Boost.Serialization存档),因为我注意到似乎无法检测到该文件格式。 Boost.Serialization中归档的末尾而没有捕获异常(我认为应该在归档末尾读取时抛出input_stream_error )。

Which option is preferable to the other? 哪个选项比另一个更好? Abusing Serialization archives for streaming seems odd and hacky but has the big advantage of not re-inventing the wheel, while the file format wrapping archives feels cleaner but more error-prone. 滥用序列化存档进行流式传输似乎很奇怪,但是它具有不重新发明轮子的巨大优势,而包装存档的文件格式看起来更干净但更容易出错。

Using boost serialization for streaming is not abusing it and not odd either. 使用boost序列化进行流传输不会滥用它,也不奇怪。

In fact, Boost Serialization has nothing but the streaming archive interface. 实际上,Boost序列化只不过具有流存档接口。 So yes, the applicable approach would be to do as you said: 因此,是的,适用的方法将是您所说的:

archive << number_of_items;
for(auto it = input_iterator(); it != end(); ++it)
    archive << *it;

In fact, very little stops you from doing the same in your serialize method. 实际上,几乎没有什么可以阻止您在serialize方法中执行相同的操作。 You could possibly even make it "automatic" by wrapping your stream into something (like an iterator_range ?) and extending Boost Serialization to 'understand' these, like it 'understands' containers, arrays etc. 您甚至可以通过将流包装成某种东西(例如iterator_range ?)并扩展Boost Serialization来“理解”这些内容,例如使其“理解”容器,数组等,从而使其“自动”。

The file format approach is definitely not cleaner (from the library perspective) since it ruins the archive format isolation. 从库的角度来看,文件格式方法绝对不是更干净的方法,因为它破坏了存档格式的隔离。 The serialization library has been carefully designed to avoid knowledge about the archive representation, and it would be a breach of abstraction to circumvent this. 序列化库经过精心设计,可以避免了解有关归档表示的知识,而绕过这种抽象将违反抽象原则。 Also see 另见

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM