简体繁体 English

空间优化具有许多重复项的大型阵列

[英]space optimize a large array with many duplicates

原文 2015-10-19 23:48:14 0 1 arrays/ optimization/ duplicate-data/ memory-optimization

I have an array where the index doubles as 'identifier for a collection of items' and the content of the array is a group-number. 我有一个数组，其中索引兼作“一组项目的标识符”，并且数组的内容是组号。 The group numbers fall into a finite range from 0..N, where N << length_of_the_array. 组号在0..N的有限范围内，其中N << length_of_the_array。 Hence every is entry will be duplicated large number of times. 因此，每个条目将被重复很多次。 Currently I have to use 2 bytes to represent group number (which can be > 1000 but < 6500 ), which due to the duplicated nature ends up consuming a lot of memory. 当前，我必须使用2个字节来表示组号（可以是> 1000但<6500），由于重复的性质，它最终会占用大量内存。

Are there ways to space optimize this array as the complete array can get into multiple MBs in size. 由于整个阵列可以进入多个MB，因此有空间优化该阵列的方法。 Appreciate any pointers toward relevant optimization algorithm/technique. 感谢任何指向相关优化算法/技术的指标。 FYI: The programming language im using is cpp. 仅供参考：im正在使用的编程语言是cpp。

1 个解决方案

Do you still want efficient random-access to arbitrary elements? 您是否仍然希望对任意元素进行有效的随机访问？ Or are you thinking about space-efficient serialization of the index->group map? 还是在考虑index-> group map的节省空间的序列化？

If you still want efficient random access, a single array lookup is not bad. 如果您仍然想要高效的随机访问，则单个阵列查找也不错。 It's at worst a single cache miss. 最糟糕的是单个缓存未命中。 Well really, at worst a page fault, or more likely a TLB miss, but that's unlikely if it's only a couple MB). 好吧，真的，最糟糕的是页面错误，或者更可能是TLB丢失，但是如果只有几MB，则不太可能。

A sorted and run-length encoded list could be binary-searched (by searching an array of prefix-sums of the repeat-counts), but that only works if you can occasionally sort the list to keep duplicates together. 可以对经过排序和游程长度编码的列表进行二进制搜索（通过搜索重复计数的前缀和数组），但这仅在您偶尔可以对列表进行排序以使重复项保持在一起的情况下才有效。

If the duplicates can't be at least somewhat grouped together, there's not much you can do that allows random access. 如果不能将重复项至少某种程度地组合在一起，那么您将无济于事，无法进行随机访问。

Packed 12-bit entries are probably not worth the trouble, unless that was enough to significantly reduce cache misses. 打包的12位条目可能不值得麻烦，除非这样做足以显着减少缓存未命中。 A couple multiply instructions to generate the right address, and a shift and mask instruction on the 16b load containing the desired value, is not much overhead compared to a cache miss. 与高速缓存未命中相比，一对用于生成正确地址的乘法指令以及包含所需值的16b负载上的移位和掩码指令没有太多开销。 Write access to packed bitfields is slower, and isn't atomic, so that's a serious downside. 对打包位域的写访问速度较慢，并且不是原子操作，因此这是一个严重的缺点。 Getting a compiler to pack bitfields using structs can be compiler-specific. 使编译器使用结构打包位域可能是特定于编译器的。 Maybe just using a char array would be best. 也许只使用char数组是最好的。