简体   繁体   English

搜索已排序静态数组的最快方法

[英]Fastest way to search in sorted static array

I am looking for the fastest way to search in a sorted, fixed array of 32 bit keys. 我正在寻找一种最快的方式来搜索一个排序,固定的32位密钥数组。 The array size and data is static and will never change. 数组大小和数据是静态的,永远不会改变。 The size of this array is ~1000-10000 unique elements. 该数组的大小约为1000-10000个唯一元素。 The search range is significantly broader (~100000) so a lot of searched values will not be found. 搜索范围明显更广(~100000),因此无法找到大量搜索值。 I am interested in exact matches only. 我只对完全匹配感兴趣。

Here is how the search proceeds: 以下是搜索的进度:

  1. Generate ~100 keys. 生成~100个键。 These keys are in order of relevance so they can't be simply sorted 这些键按相关性排序,因此无法对其进行简单排序
  2. Search for the set of ~100 keys in a collection of static arrays (typically between 50 and 300 of them) 在静态数组集合中搜索~100个键的集合(通常在50到300之间)
  3. Stop the search when we have found enough matching results (hence the importance of not sorting the keys to get the most relevant results) 当我们找到足够的匹配结果时停止搜索(因此,不对键进行排序以获得最相关的结果的重要性)

A potentially interesting property of the keys is that even if they are not close in term of integer value, most of them will only have a few different bits (~1-4) from their closest neighbor. 密钥的一个潜在有趣的特性是,即使它们在整数值方面不是很接近,它们中的大多数也只与它们最近的邻居有几个不同的位(~1-4)。

Most answers I found point towards binary search but none deal with the case of a static array, which probably opens up some optimization possibilities. 我发现大多数答案都指向二进制搜索,但没有一个涉及静态数组的情况,这可能会开辟一些优化可能性。

I have full control over the data structure, right now it is a fixed, sorted array but I could change that if it's not optimal. 我完全控制数据结构,现在它是一个固定的,排序的数组,但如果它不是最佳的我可以改变它。 I could also add precomputed information since the data doesn't change if it doesn't take an unreasonable amount of memory. 我还可以添加预先计算的信息,因为如果不占用不合理的内存量,数据不会改变。

The goal is to be efficient both in CPU and memory although CPU is the priority here. 目标是在CPU和内存方面都很高效,尽管CPU是这里的优先事项。

Using C++ although that probably won't affect the answer much. 使用C ++虽然这可能不会对答案产生太大影响。

Considering that your static arrays never change, and that you have infinite pre-processing power I think the best approach would be to create a specific hash function for each of your arrays. 考虑到您的静态数组永远不会改变,并且您拥有无限的预处理能力,我认为最好的方法是为每个数组创建一个特定的哈希函数。

My approach - define a parameterized hash function (code in java): 我的方法 - 定义参数化哈希函数(java中的代码):

private static Function<Long, Integer> createHashFunction(int sz) {
    int mvLeft = ThreadLocalRandom.current().nextInt(30);
    int mvRight = ThreadLocalRandom.current().nextInt(16);
    int mvLeft2 = ThreadLocalRandom.current().nextInt(10);
    int mvRight2 = ThreadLocalRandom.current().nextInt(16);
    int mvLeft3 = ThreadLocalRandom.current().nextInt(16);
    int mvRight3 = ThreadLocalRandom.current().nextInt(20);
    return (key) -> {
        // These operations are totally random, and has no mathematical background beneath them!
        key = ~key + (key << mvLeft);
        key = key ^ (key >>> mvRight);
        key = key + (key << mvLeft2);
        key = key ^ (key >>> mvRight2);
        key = key + (key << mvLeft3);
        key = key ^ (key >>> mvRight3);
        return (int) (Math.abs(key) % sz); // sz is the size of target array
    };
}

For each test array find such a combination of parameters, that max bucket size is the smallest. 对于每个测试阵列,找到这样的参数组合,即最大桶大小是最小的。

Some testing (input array has the size of 10k, filled with random elements): 一些测试(输入数组的大小为10k,填充了随机元素):

  • Hash mapping into [0..262k] results in a bucket of 2 items max. 哈希映射到[0..262k]会产生最多2个项目的桶。 5k random arrays tested, single-threaded version finds hash functions at ~100 arrays/second rate. 测试了5k随机阵列,单线程版本以~100阵列/秒速率查找散列函数。

Considering that with the max bucket size of 2 it is possible to map both values into one 64-bit integer, this approach will result in only one memory jump and the simplest operations for CPU - hashing is made through xor, plus and shifts, which should be extremely fast as well as bits comparison. 考虑到最大桶大小为2,可以将两个值映射到一个64位整数,这种方法只会导致一次内存跳转,而最简单的CPU操作 - 散列是通过xor,plus和shift进行的,应该是非常快和比特比较。

However your data may not be so good, and may require bucket size of 3, which destroys possibility of long long usage for bucket items. 但是,您的数据可能不是那么好,并且可能需要3的铲斗大小,这会破坏铲斗物品long long使用的可能性。 In this case you can try to find some decent hash function instead of the random mess I've written. 在这种情况下,您可以尝试找到一些不错的哈希函数,而不是我写的随机混乱。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM