简体   繁体   English

是否有特定于URL的hashCode方法?

[英]Are there URL specific hashCode methods?

Is there a way for a memory efficient "ID generation" of an URL? 是否有一种方法可以实现URL的内存高效“ID生成”?

At the moment I have a cache ala Set<String> for my URLs and I can easily check if the URL was already resolved by my crawler or not. 目前我有一个缓存ala Set<String>为我的URL,我可以轻松检查URL是否已经由我的爬虫解决。 Now this requires a lot of memory and I replaced it with Set<Long> and used the hashCode of the URLs. 现在这需要大量内存,我用Set<Long>替换它并使用了URL的hashCode。 The problem now is that even for 40k URLs there are 10 conflicts. 现在的问题是,即使是40k的URL也有10个冲突。 An improved method which uses long instead of the int hashCode improves it a bit to 6 conflicts, but especially short urls look very similar at the beginning made problems: 一个使用long而不是int hashCode的改进方法将它改进了6个冲突,但特别是短网址在开始时看起来非常相似:

5852015146777169869 http://twitpic.com/5xuwuk vs. http://twitpic.com/5xuw7m 5852015146777169869 5852015146777169869 http://twitpic.com/5xuwuk vs. http://twitpic.com/5xuw7m 5852015146777169869

So I ended up in the following URL-specific double hashing method which gives no conflicts for 2.5mio URLs which is fine for me: 所以我最终得到了以下特定于URL的双散列方法,这对于2.5米网址没有任何冲突,这对我来说很好:

public static long urlHashing(String str) {
    if (str.length() < 2)
        return str.hashCode();

    long val = longHashCode(str, 31, false);
    if (str.length() > 3)
        // use the end of the string because those short URLs
        // are often identical at the beginning
        return 43 * val + longHashCode(str.substring(str.length() / 2), 37, true);
    return val;
}

public static long longHashCode(String str, int num, boolean up) {
    int len = str.length();
    if (len == 0)
        return 0;

    long h = 0;
    // copying to a temp arry is a only a tiny bit slower in our case.
    // so this here is ~2ms faster for 40k urls
    if (up)
        for (int i = 0; i < len;) {
            h = num * h + str.charAt(i++);
        }
    else
        for (int i = len - 1; i >= 0;) {
            h = num * h + str.charAt(i--);
        }

    return h;
}

BUT Now I wondered: are there some theories or (google ;)) papers about URL specific hashing algorithms? 但是现在我想知道:是否有一些关于URL特定哈希算法的理论或(google;))论文? Or simply: can I further reduce the conflicts for URLs or do you see any problems or improvements for my current solution? 或者简单地说:我可以进一步减少URL的冲突,还是看到我当前的解决方案有任何问题或改进?

Update: 更新:

  • Another approach is to separate the URL by protocol, address and file like it is done in the new URL(str).hashCode() method (which cannot be directly used as it is very slow -> it resolves the URL on the fly :/) 另一种方法是按照协议,地址和文件分隔URL,就像在new URL(str).hashCode()方法中一样(由于它非常慢而不能直接使用 - >它动态地解析了URL: /)
  • See squid-cache.org or the CacheDigest explanation 请参阅squid-cache.orgCacheDigest说明

If you want something that works all the time, not just most of the time, short hashes aren't going to cut it. 如果你想要的东西一直有效,而不仅仅是大部分时间,短暂的哈希不会削减它。 At any length shorter than about 128 bits, as you've observed, even an ideal hash will have a significant collision rate.g. 正如您所观察到的,在任何长度短于大约128位的情况下,即使理想的散列也会具有显着的冲突率。 What you have is a scaling problem, and all you're done by using hash codes is reduce the constant factor- it's still O(n). 你所拥有的是一个扩展问题,你通过使用哈希码完成的所有工作就是减少常数因子 - 它仍然是O(n)。

It sounds like your strings have a lot of prefixes in common, though - have you considered using a trie to store them? 听起来你的字符串有很多共同的前缀 - 你有没有考虑使用trie来存储它们?

You should probably use an MD5 hash . 您应该使用MD5哈希 The collision rate should be much smaller. 碰撞率应该小得多。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM