简体   繁体   English

如何在C ++中将稀疏数组有效地保存在文件中?

[英]How can I efficiently save a sparse array in a file in C++ ?

I have an array of doubles having 6 indices, and it is mostly filled with zeros. 我有一个具有6个索引的双精度数组,并且大多数都是零。 I don't know yet what type should I use to storage it in the memory. 我还不知道应该使用哪种类型将其存储在内存中。

But, most importantly: I would like to save it into a file (a binary file?). 但是,最重要的是:我想将其保存到文件(二进制文件?)中。 What is the most efficient way to save it? 最有效的保存方法是什么? One requirement is that I can run through all the non-zero entries without passing by the zeros. 一项要求是,我可以遍历所有非零条目而不必传递零。 If I run 6 nested for I'll need too many lives. 如果我跑了6个巢穴for将需要太多生命。

Moreover, I don't know how to practically save it: Do I need two files, one acting as an index and the second one containing all the values? 此外,我不知道如何实际保存它:是否需要两个文件,一个用作索引,第二个包含所有值?

Thanks! 谢谢!

This is probably a solved problem; 这可能是一个已解决的问题; there are probably sparse-matrix libraries that give you efficient in-memory representations too. 可能还有稀疏矩阵库也可以为您提供有效的内存表示形式。 (eg each row is a list of index:value , stored in a std::vector , linked list, hash, or other data structure, depending on whether inserting single non-zero values in the middle is valuable or whatever other operation is important). (例如,每一行都是index:value的列表,存储在std::vector ,链接列表,哈希或其他数据结构中,具体取决于在中间插入单个非零值是否有价值或其他重要操作) )。


A binary format will be faster to store/load, but whether you go binary or text isn't important for some ways of representing a sparse array. 二进制格式将可以更快地存储/加载,但是对于表示稀疏数组的某些方式来说,使用二进制还是文本并不重要。 If you write a binary format, endian-agnostic code is a good way to make sure it's portable and doesn't have bugs that only show up on some architectures. 如果以二进制格式编写,则与字节序无关的代码是确保其可移植性并且没有仅在某些体系结构上显示的错误的好方法。

Options: 选项:

  • Simple but kind of ugly: gzip / lz4 / lzma the buffer holding your multidimensional array, writing the result to disk. 简单但有点丑陋:gzip / lz4 / lzma存放多维数组的缓冲区,将结果写入磁盘。 Convert to little-endian on the fly while saving/loading, or store an endianness flag in the format. 在保存/加载时即时转换为little-endian,或以格式存储endianness标志。

  • Same idea but store all 6 indices with each value. 相同的想法,但存储每个值的所有6个索引。 Good if many inner-most arrays have no non-zero values, this may be good. 如果许多最里面的数组都没有非零值,那就很好了。 Every non-zero value has a separate record (line, in a text-based format). 每个非零值都有一个单独的记录(行,基于文本的格式)。 Sample line (triple-nested example for readability, extends to 6 just fine): 样例行(为便于阅读,使用三重嵌套示例,将其扩展到6就可以了):

dimensions on the first line or something
a b c  val
...
3 2 5   -3.1416

means: matrix[3][2][5] = -3.1416 表示: matrix[3][2][5] = -3.1416

  • Use a nested sparse-array representation: each row is a list of index:value. 使用嵌套的稀疏数组表示:每行都是index:value的列表。 Non-present indices are zero. 不存在的索引为零。 A text format could use spaces and newlines to separate things; 文本格式可以使用空格和换行符分隔事物; a binary format could use a length field at the start of each row or a sentinel value at the end. 二进制格式可以在每行的开头使用一个长度字段,或者在结尾使用一个哨兵值。

    You could flatten the multidimensional array out to one linear index for storage with 32bit integer indices, or you could represent the nesting somehow. 您可以将多维数组展平为一个线性索引,以使用32位整数索引进行存储,也可以以某种方式表示嵌套。 I'm not going to try to make up a text format for this, since it got ugly as I started to think about it. 我不会尝试为此编写一种文本格式,因为在我开始考虑它时,它变得很难看。

A regular flat representation of a 6 dimension array ... 6维数组的常规平面表示形式...

double[10][10][10][10][10][10] = 1million entries * 8 bytes ~= 8MB double [10] [10] [10] [10] [10] [10] = 1百万个条目* 8字节〜= 8MB

An associative array Index:Value representation, assume 50% of entries are 0.0 ... using a 4 byte 32bit index ... 关联数组Index:Value表示形式,假设50%的条目为0.0 ...使用4字节的32位索引...

500,000 * 4 bytes + 500,000 * bytes ~= 6MB 500,000 * 4字节+ 500,000 *字节〜= 6MB

A bit map representation of the sparse array, assume 50% of entries are 0.0 ... bits are set so that every byte represents 8 entries in the array 10000001b would mean 8 entries where only the first and last are represented and the 6 middle values are ignored since they are zero ... 稀疏数组的位图表示,假设50%的条目为0.0 ...设置了位,以便每个字节表示数组10000001b中的8个条目将意味着8个条目,其中仅表示第一个和最后一个,并表示6个中间值被忽略,因为它们为零...

ceil(1million / 8) bytes + 500,000 * 8 bytes ~= 4.125MB ceil(1百万/ 8)字节+ 500,000 * 8字节〜= 4.125MB

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM