fast way to check if two sets contain at least one same element

Question

I have two TreeMaps and I want to check if they contain at least one identical key (the keys are Strings). So I use two loops for comparison:

boolean found = false;
for(String key1 : map1.keySet()){
    for(String key2 : map2.keySet()){
        if(key1.equals(key2)){
            found = true;
            break;
        }
    }
    if(found){
        break;
    }
}
if(found){
    someFunction(map1, map2);
}

As I have 500,000 TreeMaps (with about 1000 keys each) and I want to check each map against each other map, it takes a long time. Does anyone know a faster solution?

*Edit: I want to call the "someFunction()"-method every time I find two maps with at leat one same key. I think in >90% of all cases found == false

Answer 1

One way you could try is to make a multimap of key->maps, ie iterate over all 500k maps and add them for each key they contain.

Then iterate over the keys again and if there are two or more maps for a key, those maps share it.

With that approach complexity should drop from O(n² * m) to O(n * m) ( n being the number of maps and m being the number of keys).

Rough outline:

Multimap<Key, Map<Key, Value>> mapsContainingKey = ... ;//could be a Guava Multimap
//O(n * m) complexity
for(Map<Key, Value> m : largeSetOfTreeMaps ) {
  for(Key k : m.keySet() ) {
    mapsContainingKey.put( k, m );
  }
}

//O(m)
for( Entry<Key, Map<Key, Value>> entry : mapsContainingKey.entries() ) {
  Key key = entry.getKey();
  Collection<Map<Key, Value>> mapsWithSameKey = entry.getValue();
  if( mapsWithSameKey.size() > 1 ) {
    //all maps in that collection share this key
  }
}

Update: I ran a quick benchmark and though it is not optimized there's a clear trend:

The "naive" approach is looping over all maps and checking against all following maps so that each pair is only checked once. Additionally I applied what Holger suggested for comparing two maps.

The "map" approach is what I posted here.

Results on my machine for 1000 maps with each having 100 random String keys of length 10:

naive: 11656 ms
map:     235 ms

Update 2: Some more results with different sizes:

1000 maps with 100 keys of varying length (the longer the keys, the less collisions)

key length   1        2         3         4         5        10        20
naive      417 ms  3221 ms  10937 ms  11273 ms  11357 ms  11383 ms  11706 ms
map         16 ms    43 ms     86 ms    224 ms    245 ms    210 ms    154 ms

1000 maps with varying number of keys each and key length 10 (the more keys, the more collisions)

key count    50       100       500
naive      4865 ms  11368 ms  81280 ms 
map          64 ms    206 ms    913 ms

Varying number of maps with 1000 keys each and key length 10 (the more maps, the more collisions)

map count    500     1000      2000
naive      6323 ms  12766 ms  47798 ms 
map         139 ms    206 ms    333 ms

As you can see, the number of maps has the most influence on this followed by the number of keys.

Answer 2

You didn't say anything about the ordering but I assume that all TreeMap s have the same ordering. In this case you can reduce the outer iteration range by using the bounds of the second map. Your inner iteration is completely obsolete as you can simply ask the map whether it contains the key.

for(String s: map1.navigableKeySet().subSet(map2.firstKey(), true, map2.lastKey(), true)) {
    if(map2.containsKey(s)) {
        someFunction(map1, map2);
        break;
    }
}

Explanation:

Suppose you have the following map keys:

map2:    D, E, F, G, H
         |           |
       first        last
map1: A,    E,    G,   I
            |<--->|
          subset("D", true, "H", true)

Here, map2 's first element is "D" and its last element is "H" . When passing these elements as inclusive bounds to map1's navigableKeySet().subSet(…) method, we'll get the closest inner set ["E", "G"] as search range, hence we have precluded "A" and "I" before we even started our linear search (keep in mind that these are only example placeholders, they might stand for a large number of keys).

By thinking about it even more, you can skip arbitrary ranges in both maps when comparing:

public static boolean haveCommonKeys(TreeMap<String,?> map1, TreeMap<String,?> map2) {
    if(map1.isEmpty()) return false;
    for(String s=map1.firstKey(); s!=null; ) {
        String s2=map2.ceilingKey(s);
        if(s2==null) break;
        if(s2.equals(s)) return true;
        s=map1.ceilingKey(s2);
        if(s2.equals(s)) return true;
    }
    return false;
}

In this solution, we start with the first (smallest) key of a map and ask each map for a key that is the same or bigger than the value we found in the other map. This way we will skip all consecutive keys of a map for which the other map contains no in-between key.

Answer 3

Create your own map which contains to every key a set of your objects. if you call a getter on a key you will get the set of the objects. if you call size() on this set you know if there are more than one object mapped to this key. but you shouldnt put all data in one map, because this will slow it down hardcore. better you sort your keys if you can. like all keys made of numbers in one map, all made of letters in one map and the rest in a third map. then you can check the key, take the map which belongs to it and work with it. like this:

public class MyMap{

private Map<String key, Set<Object>> stuff;

 public MyMap(){
  stuff = new HashMap<String key, Set<Object>>(); // Or any other map instead of HashMap
 }

 public void put(final String pKey, final Object pObject){
  Set<Object> objects = stuff.get(pKey);
  if(objects!=null)
   objects.add(pObject);
  else{
   Set<Object> objects = new HashSet<Object>();
   objects.add(pObject);
   stuff.put(pKey, objects);
  }
 }

 public Set<Object> get(String pKey){
  return stuff.get(pKey);
 }

 public void remove(String pKey){
  stuff.remove(pKey);
 }

}

But becareful, this rlly can destroy your performance if you have so much maps. you have to split the keys up to make it faster :) also you can use any other map/set. i used HashSet because i think you dont want to add the same object twice to the same key if you want to do checks like you told us.

Hope i could help :)

fast way to check if two sets contain at least one same element

Question

3 answers

solution1
5 ACCPTED 2015-02-03 13:33:12

solution2
2 2015-02-03 13:58:41

Explanation:

solution3
0 2015-02-03 13:56:28

fast way to check if two sets contain at least one same element

Question

3 answers

solution1 5 ACCPTED 2015-02-03 13:33:12

solution2 2 2015-02-03 13:58:41

Explanation:

solution3 0 2015-02-03 13:56:28

solution1
5 ACCPTED 2015-02-03 13:33:12

solution2
2 2015-02-03 13:58:41

solution3
0 2015-02-03 13:56:28