Optimal solution of removing duplicates from an unsorted string

Question

I am working on a interview problem on removing duplicate characters from a string.

The naive solution actually is more difficult to implement, that is using two for-loops to check each index with a current index.

I tried this problems a couple times, with the first attempt only working on sorted strings ie aabbcceedfg that is O(n) .

I then realized I could use a HashSet . This solution's time complexuty is O(n) as well, but uses two Java library classes such as StringBuffer and HashSet , making its space complexity not that great.

public static String duplicate(String s) {
    HashSet<Character> dup = new HashSet<Character>();
    StringBuffer string = new StringBuffer();

    for(int i = 0; i < s.length() - 1; i++) {
        if(!dup.contains(s.charAt(i))){
            dup.add(s.charAt(i));
            string.append(s.charAt(i));
        }
    }
    return string.toString();
}

I was wondering - is this solution optimal and valid for a technical interview? If it's not the most optimal, what is the better method?

I did Google a lot for the most optimal solution to this problem, however, most solutions used too many Java-specific libraries that are totally not valid in an interview context.

Answer 1

You can't improve on the complexity but you can optimize the code while keeping the same complexity.

Use a BitSet instead of a HashSet (or even just a boolean[] ) - there are only 65536 different characters, which fits in 8Kb. Each bit means "whether you have seen the character before".
Set the StringBuffer to a specified size - a very minor improvement
Bugfix: your for-loop ended at i < s.length() - 1 but it should end at i < s.length() , else it will ignore the last character of the string.

-

public static String duplicate(String s) {
    BitSet bits = new BitSet();
    StringBuffer string = new StringBuffer(s.length());

    for (int i = 0; i < s.length(); i++) {
        if (!bits.get(s.charAt(i))) {
            bits.set(s.charAt(i));
            string.append(s.charAt(i));
        }
    }
    return string.toString();
}

Answer 2

When using sets/maps, don't forget that almost all methods return values. For example, Set.add returns whether it was actually added. Set.remove returns whether it was actually removed. Map.put and Map.remove return the previous value. Using this you don't need to query the set twice, just change to if(dup.add(s.charAt(i))) ... .

The second improvement from the performance point of view could be to dump the String into char[] array and process it manually without any StringBuffer/StringBuilder :

public static String duplicate(String s) {
    HashSet<Character> dup = new HashSet<Character>();
    char[] chars = s.toCharArray();

    int i=0;
    for(char ch : chars) {
        if(dup.add(ch))
            chars[i++] = ch;
    }
    return new String(chars, 0, i);
}

Note that we are writing result in the same array which we are iterating. This works as resulting position never exceeds iterating position.

Of course using BitSet as suggested by @ErwinBolwidt would be even more performant in this case:

public static String duplicate(String s) {
    BitSet dup = new BitSet();
    char[] chars = s.toCharArray();

    int i=0;
    for(char ch : chars) {
        if(!dup.get(ch)) {
            dup.set(ch, true);
            chars[i++] = ch;
        }
    }
    return new String(chars, 0, i);
}

Finally just for completeness there's Java-8 Stream API solution which is slower, but probably more expressive:

public static String duplicateStream(String s) {
    return s.codePoints().distinct()
            .collect(StringBuilder::new, StringBuilder::appendCodePoint,
                    StringBuilder::append).toString();
}

Note that processing code points is better than processing chars as your method will work fine even for Unicode surrogate pairs.

Answer 3

If it's a really long string your algorithm will spend most of it's time just throwing away characters.

Another approach that could be faster with long strings (like book-long) is to simple go through the alphabet, looking for the first occurrence of each character and store the index at which is found. Once all characters have been found create the new string based on where it was found.

package se.wederbrand.stackoverflow.alphabet;

import java.util.HashMap;
import java.util.Map;

public class Finder {
    public static void main(String[] args) {
        String target = "some really long string"; // like millions of characters
        HashMap<Integer, Character> found = new HashMap<Integer, Character>(25);

        for (Character c = 'a'; c <= 'z'; c++) {
            int foundAt = target.indexOf(c);
            if (foundAt != -1) {
                found.put(foundAt, c);
            }
        }

        StringBuffer result = new StringBuffer();
        for (Map.Entry<Integer, Character> entry : found.entrySet()) {
            result.append(entry.getValue());
        }

        System.out.println(result.toString());
    }
}

Note that on strings where at least one character is missing this will be slow.

Optimal solution of removing duplicates from an unsorted string

Question

3 answers

solution1
3 ACCPTED 2015-09-18 04:35:53

solution2
0 2015-09-18 04:56:00

solution3
-1 2015-09-18 04:45:45

Optimal solution of removing duplicates from an unsorted string

Question

3 answers

solution1 3 ACCPTED 2015-09-18 04:35:53

solution2 0 2015-09-18 04:56:00

solution3 -1 2015-09-18 04:45:45

solution1
3 ACCPTED 2015-09-18 04:35:53

solution2
0 2015-09-18 04:56:00

solution3
-1 2015-09-18 04:45:45