简体   繁体   English

为什么哈希集的性能比列表更快?

[英]Why the hashset's performance is way faster than list?

This problem is from leetcode ( https://leetcode.com/problems/word-ladder/ )! 这个问题来自leetcode( https://leetcode.com/problems/word-ladder/ )!

Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that: 给定两个单词(beginWord和endWord)以及字典的单词列表,找到从beginWord到endWord的最短转换序列的长度,例如:

Only one letter can be changed at a time. 一次只能更改一个字母。 Each transformed word must exist in the word list. 每个转换的单词都必须存在于单词列表中。 Note that beginWord is not a transformed word. 注意beginWord不是转换后的单词。 Note: 注意:

Return 0 if there is no such transformation sequence. 如果没有这样的转换序列,则返回0。 All words have the same length. 所有单词的长度相同。 All words contain only lowercase alphabetic characters. 所有单词仅包含小写字母字符。 You may assume no duplicates in the word list. 您可以假设单词列表中没有重复项。 You may assume beginWord and endWord are non-empty and are not the same. 您可以假设beginWord和endWord为非空并且不相同。

This is my code which takes 800 ms to run: 这是我的代码,需要800毫秒才能运行:

class Solution {
public int ladderLength(String beginWord, String endWord, List<String> wordList){
    if(!wordList.contains(endWord))
        return 0;
    int ret = 1;
    LinkedList<String> queue = new LinkedList<>();
    Set<String> visited = new HashSet<String>();
    queue.offer(beginWord);
    queue.offer(null);
    while(queue.size() != 1 && !queue.isEmpty()) {
        String temp = queue.poll();
        if(temp == null){
            ret++;
            queue.offer(null);
            continue;                
        }
        if(temp.equals(endWord)) {
            //System.out.println("succ ret = " + ret);
            return ret;
        }
        for(String word:wordList) {           
            if(diffOf(temp,word) == 1){
                //System.out.println("offered " + word);
                //System.out.println("ret =" + ret);
                if(!visited.contains(word)){
                visited.add(word);
                queue.offer(word); 
                }
            }
        }
    }
    return 0;
}
private int diffOf(String s1, String s2) {
    if(s1.length() != s2.length())
        return Integer.MAX_VALUE;
    int dif = 0;
    for(int i=0;i < s1.length();i++) {
        if(s1.charAt(i) != s2.charAt(i))
            dif++;
    }
    return dif;    
}
}

Here is another code which takes 100ms to run: 这是另一个需要100毫秒才能运行的代码:

class Solution {
public int ladderLength(String beginWord, String endWord, List<String> wordList) {
    Set<String> set = new HashSet<>(wordList);
    if (!set.contains(endWord)) {
        return 0;
    }

    int distance = 1;
    Set<String> current = new HashSet<>();
    current.add(beginWord);

    while (!current.contains(endWord)) {
        Set<String> next = new HashSet<>();

        for (String str : current) {
            for (int i = 0; i < str.length(); i++) {
                char[] chars = str.toCharArray();

                for (char c = 'a'; c <= 'z'; c++) {
                    chars[i] = c;
                    String s = new String(chars);

                    if (s.equals(endWord)) {
                        return distance + 1;
                    }

                    if (set.contains(s)) {
                        next.add(s);
                        set.remove(s);
                    }
                }
            }
        }
        distance++;

        if (next.size() == 0) {
            return 0;
        }
        current = next;
    }

    return 0;
}
}

I think the second code is less efficient, because it test 26 letters for each word. 我认为第二个代码效率较低,因为它每个单词测试26个字母。 Why is it so fast? 为什么这么快?

Short answer: Your breath-first search does orders of magnitude more compares per 'word distance unit' (hereafter called iteration). 简短的答案:您的“呼吸优先”搜索与“单词距离单位”(以下称为“迭代”)相比要多几个数量级。

  • You compare every candidate to every remaining word. 您将每个候选人与每个剩余的单词进行比较。 Time complexity T(N×n) per iteration, 每次迭代的时间复杂度T(N×n),
  • They compare every candidate to artificially constructed 'next' candidates. 他们将每个候选人与人为构建的“下一个”候选人进行比较。 And because they construct candidates they don't have to 'calculate' the distance. 而且由于他们构造了候选者,所以不必“计算”距离。 For simplicity, I assume both (constructing or checking) have the same running time. 为简单起见,我假设两者(构造或检查)具有相同的运行时间。 The time complexity is T(26×l×n) per iteration. 每次迭代的时间复杂度为T(26×l×n)。

(N=word list size, n = number of candidates for this iteration, l = word length) (N =单词列表大小,n =此迭代的候选数,l =单词长度)

Of course 26×l×n is much less than N×n because the word length is small but the word list is huge. 当然26×l×n比N×n小得多,因为单词长度很小,但单词列表很大。

I tried your routine on ("and","has",[List of 2M English words]) and after 30 seconds I killed it because I thought it crashed. 我在("and","has",[List of 2M English words])上尝试了您的例程,并在30秒后将其杀死,因为我认为它已崩溃。 It didn't crash, it was just slow. 它没有崩溃,只是很慢。 I turned to another word list of 50K and yours now takes 8 seconds, vs 0.04s for their implementation. 我转到另一个50K的单词列表,您的单词列表现在需要8秒,而实现它们需要0.04秒。

For my word list of N=51306 there are 2167 3-letter words. 对于我的N = 51306的单词列表,有2167个3个字母的单词。 This means that for every word, on average, there are 3×cbrt(2167) possible candidates, which is n≈38.82. 这意味着平均每个单词有3个cbrt(2167)个可能的候选词,即n≈38.82。

  • Their expected performance: T(26×l×n) ≈ T(3027) work per iteration, 他们的预期性能:每次迭代T(26×l×n)≈T(3027)工作,
  • Your expected performance: T(N×n) ≈ T(1991784) work per iteration. 您的预期性能:每次迭代T(N×n)≈T(1991784)工作。

(assuming word list does not get shorter; but with this many words the difference is negligible) (假设单词列表不会变短;但是有了这么多单词,差异可以忽略不计)


Incidentally, your queue-based circular buffer implementation is possibly faster than their two-alternating-Sets implementation, so you could make a hybrid that's even faster. 顺便说一句,基于队列的循环缓冲区实现可能比其两个交替集实现更快,因此您可以使混合实现更快。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM