简体   繁体   English

在clojure中构建bloom过滤器时要使用哪种散列技术?

[英]What hashing techniques to use when building a bloom filter in clojure?

I want to build a bloom filter in Clojure but I don't have much knowledge of all the hashing libraries that may be available to JVM based languages. 我想在Clojure中构建一个bloom过滤器,但我对基于JVM的语言可用的所有散列库知之甚少。

What should I use for the fastest (as opposed to most accurate) bloom map implementation in Clojure? 在Clojure中,我应该使用什么来实现最快(而不是最精确)的bloom map实现?

Take a look at the Bloom Filter implementation in Apache Cassandra . 看一下Apache Cassandra中的Bloom Filter实现。 This uses the very fast MurmurHash3 algorithm and combines two hashes (or two portions of the same hash, since upgrading to MurmurHash3 instead of MurmurHash2) in different ways to calculate the desired number of hashes. 这使用非常快的MurmurHash3算法,并以不同的方式组合两个哈希(或相同哈希的两个部分,因为升级到MurmurHash3而不是MurmurHash2)来计算所需的哈希数。

The combinatorial generation approach is described in this paper 本文描述了组合生成方法

and here's a snippet from the Cassandra sourcecode: 这是Cassandra源代码的一个片段:

    long[] hash = MurmurHash.hash3_x64_128(b, b.position(), b.remaining(), 0L);
    long hash1 = hash[0];
    long hash2 = hash[1];
    for (int i = 0; i < hashCount; ++i)
    {
        result[i] = Math.abs((hash1 + (long)i * hash2) % max);
    }

See also Bloomfilter and Cassandra = Why used and why hashed several times? 另请参阅Bloomfilter和Cassandra =为什么使用以及为什么要多次使用?

So the fun thing about bloom filters is that to work effectively they need multiple hash functions. 因此,布隆过滤器的有趣之处在于,为了有效地工作,它们需要多个散列函数。

Java Strings already have one hash function built in that you can use - String.hashCode() with returns a 32-bit integer hash. Java Strings已经内置了一个可以使用的哈希函数 - String.hashCode()返回一个32位整数哈希。 It's an OK hashcode for most purposes, and it's possible that this is sufficient: if you partition this into 2 separate 16-bit hashcodes for example then this might be good enough for your bloom filter to work. 对于大多数用途来说,它是一个OK哈希码,这可能就足够了:例如,如果你将它分成2个独立的16位哈希码,那么这可能足以使你的布隆过滤器工作。 You will probably get a few collisions but that's fine - bloom filters are expected to have some collisions. 您可能会遇到一些碰撞,但这很好 - 布隆过滤器预计会发生一些碰撞。

If not, you'll probably want to roll your own, in which case I'd recommend using String.getChars() to access the raw char data, then use this to calculate multiple hashcodes. 如果没有,你可能想要自己动手,在这种情况下我建议使用String.getChars()来访问原始char数据,然后使用它来计算多个哈希码。

Clojure code to get you started (just summing up the character values): Clojure代码让你开始(只是总结字符值):

(let [s "Hello"
      n (count s)
      cs (char-array n)]
  (.getChars s 0 n cs 0)
  (areduce cs i v 0 (+ v (int (aget cs i)))))
=> 500

Note the use of Clojure's Java interop to call getChars, and the use of areduce to give you a very fast iteration over the character array. 注意使用Clojure的Java互操作来调用getChars,并使用areduce来对字符数组进行非常快速的迭代。

You may also be interested in this Java bloom filter implementation I found on Github: https://github.com/MagnusS/Java-BloomFilter . 您可能也对我在Github上找到的这个Java bloom过滤器实现感兴趣: https//github.com/MagnusS/Java-BloomFilter The hashcode implementation looks OK at first glance but it uses a byte array which I think is a bit less efficient than using chars because of the need to deal with the character encoding overhead. 哈希码实现乍一看看起来还不错,但是它使用了一个字节数组,我觉得它比使用字符的效率要低一些,因为需要处理字符编码开销。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM