简体   繁体   中英

What it is the best performance way to grab duplicates from a large Set<String>?

I have a large Set<String> that contains many words, say:

"aaa, cCc, dDD, AAA, bbB, BBB, AaA, CCc, ..."

I want to group all duplicate words from the Set ignoring the case sensitivity of the words then save them in a Vector<Vector<String>> or whatever, so each Vector<String> item will contain the group of similar words, like this:

Vector<String> : aaa, AAA, AaA, ...

Vector<String> : cCc, CCc, ...

Vector<String> : bbB, BBB, ...

I care about the performance as this Set contain many words.

If you truly care about performance you would not use Vector . As for the sorting problem one solution would be to use the TreeMap or TreeSet object and create a Comparator that does the equality (sorting) you want.

The instantiation could be:

new TreeMap<String,LinkedList<String>>(new Comparator<String>() {

   // comparator here

});

Usage:

LinkedList<String> entry = map.get(nextKey);
if (entry == null) {
  entry = new LinkedList<String>()
  map.put(nextKey, entry);
}
entry.add(nextKey);

I would create a HashMap<String, Vector<String>> hashMap . Next, for each 'string' in your set

if (!hashMap.containsKey(string.toLowerCase()){
     Vector v = new Vector();
     v.add(string);
      hashMap.put(string.toLowerCase(), v);
} else { 
     hashMap.get(string.toLowerCase()).add(string);
}

At the end, create a Vector of vectors if needed, or work with the hashmap.valueSet()

If you can choose Set implementation you can use TreeSet with Comparator that compares strings ignoring case. Then you will be able to iterate over sorted list and easily group duplicates.

This iterates over the input set once and I doubt you can get much faster than that. Swapping the ArrayList s for LinkedLists may trade locality for less copying, which may be an performance gain, but I doubt it. Here's the code:

Set<String> input = new HashSet<String>(Arrays.asList(
    "aaa", "cCc", "dDD", "AAA", "bbB", "BBB", "AaA", "CCc"));

Map<String, List<String>> tmp = new HashMap<String, List<String>>();

for (String s : input) {
    String low = s.toLowerCase();
    List<String> l = tmp.get(low);

    if (l == null) {
        l = new ArrayList<String>();
        tmp.put(low, l);
    }

    l.add(s);
}

final List<List<String>> result = new ArrayList<List<String>>(tmp.values());

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM