Why the hashset's performance is way faster than list?

Question

This problem is from leetcode ( https://leetcode.com/problems/word-ladder/ )!

Given two words (beginWord and endWord), and a dictionary's word list, find the length of shortest transformation sequence from beginWord to endWord, such that:

Only one letter can be changed at a time. Each transformed word must exist in the word list. Note that beginWord is not a transformed word. Note:

Return 0 if there is no such transformation sequence. All words have the same length. All words contain only lowercase alphabetic characters. You may assume no duplicates in the word list. You may assume beginWord and endWord are non-empty and are not the same.

This is my code which takes 800 ms to run:

class Solution {
public int ladderLength(String beginWord, String endWord, List<String> wordList){
    if(!wordList.contains(endWord))
        return 0;
    int ret = 1;
    LinkedList<String> queue = new LinkedList<>();
    Set<String> visited = new HashSet<String>();
    queue.offer(beginWord);
    queue.offer(null);
    while(queue.size() != 1 && !queue.isEmpty()) {
        String temp = queue.poll();
        if(temp == null){
            ret++;
            queue.offer(null);
            continue;                
        }
        if(temp.equals(endWord)) {
            //System.out.println("succ ret = " + ret);
            return ret;
        }
        for(String word:wordList) {           
            if(diffOf(temp,word) == 1){
                //System.out.println("offered " + word);
                //System.out.println("ret =" + ret);
                if(!visited.contains(word)){
                visited.add(word);
                queue.offer(word); 
                }
            }
        }
    }
    return 0;
}
private int diffOf(String s1, String s2) {
    if(s1.length() != s2.length())
        return Integer.MAX_VALUE;
    int dif = 0;
    for(int i=0;i < s1.length();i++) {
        if(s1.charAt(i) != s2.charAt(i))
            dif++;
    }
    return dif;    
}
}

Here is another code which takes 100ms to run:

class Solution {
public int ladderLength(String beginWord, String endWord, List<String> wordList) {
    Set<String> set = new HashSet<>(wordList);
    if (!set.contains(endWord)) {
        return 0;
    }

    int distance = 1;
    Set<String> current = new HashSet<>();
    current.add(beginWord);

    while (!current.contains(endWord)) {
        Set<String> next = new HashSet<>();

        for (String str : current) {
            for (int i = 0; i < str.length(); i++) {
                char[] chars = str.toCharArray();

                for (char c = 'a'; c <= 'z'; c++) {
                    chars[i] = c;
                    String s = new String(chars);

                    if (s.equals(endWord)) {
                        return distance + 1;
                    }

                    if (set.contains(s)) {
                        next.add(s);
                        set.remove(s);
                    }
                }
            }
        }
        distance++;

        if (next.size() == 0) {
            return 0;
        }
        current = next;
    }

    return 0;
}
}

I think the second code is less efficient, because it test 26 letters for each word. Why is it so fast?

Answer 1

Short answer: Your breath-first search does orders of magnitude more compares per 'word distance unit' (hereafter called iteration).

You compare every candidate to every remaining word. Time complexity T(N×n) per iteration,
They compare every candidate to artificially constructed 'next' candidates. And because they construct candidates they don't have to 'calculate' the distance. For simplicity, I assume both (constructing or checking) have the same running time. The time complexity is T(26×l×n) per iteration.

(N=word list size, n = number of candidates for this iteration, l = word length)

Of course 26×l×n is much less than N×n because the word length is small but the word list is huge.

I tried your routine on ("and","has",[List of 2M English words]) and after 30 seconds I killed it because I thought it crashed. It didn't crash, it was just slow. I turned to another word list of 50K and yours now takes 8 seconds, vs 0.04s for their implementation.

For my word list of N=51306 there are 2167 3-letter words. This means that for every word, on average, there are 3×cbrt(2167) possible candidates, which is n≈38.82.

Their expected performance: T(26×l×n) ≈ T(3027) work per iteration,
Your expected performance: T(N×n) ≈ T(1991784) work per iteration.

_{(assuming word list does not get shorter; but with this many words the difference is negligible)}

Incidentally, your queue-based circular buffer implementation is possibly faster than their two-alternating-Sets implementation, so you could make a hybrid that's even faster.

Why the hashset's performance is way faster than list?

Question

1 answers

solution1
0 2019-01-15 20:00:37

Why the hashset's performance is way faster than list?

Question

1 answers

solution1 0 2019-01-15 20:00:37

solution1
0 2019-01-15 20:00:37