检查两个集合是否包含至少一个相同元素的快速方法

Question

I have two TreeMaps and I want to check if they contain at least one identical key (the keys are Strings). 我有两个TreeMap，我想检查它们是否至少包含一个相同的键（这些键是字符串）。 So I use two loops for comparison: 所以我使用两个循环进行比较：

boolean found = false;
for(String key1 : map1.keySet()){
    for(String key2 : map2.keySet()){
        if(key1.equals(key2)){
            found = true;
            break;
        }
    }
    if(found){
        break;
    }
}
if(found){
    someFunction(map1, map2);
}

As I have 500,000 TreeMaps (with about 1000 keys each) and I want to check each map against each other map, it takes a long time. 因为我有500,000个TreeMap（每个约有1000个键），并且我想对照每个地图检查每个地图，所以需要很长时间。 Does anyone know a faster solution? 有谁知道更快的解决方案？

*Edit: I want to call the "someFunction()"-method every time I find two maps with at leat one same key. *编辑：我想调用“someFunction（）” - 方法，每次我找到两个地图与leat一个相同的键。 I think in >90% of all cases found == false 我认为> 90％的案例found == false

Answer 1

One way you could try is to make a multimap of key->maps, ie iterate over all 500k maps and add them for each key they contain. 您可以尝试的一种方法是创建key-> maps的多重映射，即迭代所有500k映射并为它们包含的每个键添加它们。

Then iterate over the keys again and if there are two or more maps for a key, those maps share it. 然后再次遍历键，如果键有两个或更多映射，则这些映射共享它。

With that approach complexity should drop from O(n² * m) to O(n * m) ( n being the number of maps and m being the number of keys). 采用这种方法，复杂度应从O(n² * m)下降到O(n * m) （ n是映射的数量， m是键的数量）。

Rough outline: 粗略轮廓：

Multimap<Key, Map<Key, Value>> mapsContainingKey = ... ;//could be a Guava Multimap
//O(n * m) complexity
for(Map<Key, Value> m : largeSetOfTreeMaps ) {
  for(Key k : m.keySet() ) {
    mapsContainingKey.put( k, m );
  }
}

//O(m)
for( Entry<Key, Map<Key, Value>> entry : mapsContainingKey.entries() ) {
  Key key = entry.getKey();
  Collection<Map<Key, Value>> mapsWithSameKey = entry.getValue();
  if( mapsWithSameKey.size() > 1 ) {
    //all maps in that collection share this key
  }
}

Update: I ran a quick benchmark and though it is not optimized there's a clear trend: 更新：我运行了一个快速的基准测试，虽然没有优化，但有一个明显的趋势：

The "naive" approach is looping over all maps and checking against all following maps so that each pair is only checked once. “天真”的方法是循环遍历所有地图并检查所有后续地图，以便每对只检查一次。 Additionally I applied what Holger suggested for comparing two maps. 此外，我应用了Holger建议用于比较两张地图的内容。

The "map" approach is what I posted here. 我在这里发布的是“地图”方法。

Results on my machine for 1000 maps with each having 100 random String keys of length 10: 我的机器上有1000张地图的结果，每张地图都有100个长度为10的随机String键：

naive: 11656 ms
map:     235 ms

Update 2: Some more results with different sizes: 更新2：一些不同大小的结果：

1000 maps with 100 keys of varying length (the longer the keys, the less collisions) 1000张不同长度的100张地图（按键越长，碰撞越少）

key length   1        2         3         4         5        10        20
naive      417 ms  3221 ms  10937 ms  11273 ms  11357 ms  11383 ms  11706 ms
map         16 ms    43 ms     86 ms    224 ms    245 ms    210 ms    154 ms

1000 maps with varying number of keys each and key length 10 (the more keys, the more collisions) 1000个地图，每个密钥的密钥数量不同，密钥长度为10（密钥越多，冲突越多）

key count    50       100       500
naive      4865 ms  11368 ms  81280 ms 
map          64 ms    206 ms    913 ms

Varying number of maps with 1000 keys each and key length 10 (the more maps, the more collisions) 数量不一的地图（每个地图有1000个键，键长为10）（地图越多，碰撞越多）

map count    500     1000      2000
naive      6323 ms  12766 ms  47798 ms 
map         139 ms    206 ms    333 ms

As you can see, the number of maps has the most influence on this followed by the number of keys. 如您所见，地图数量对此影响最大，其次是密钥数量。

Answer 2

You didn't say anything about the ordering but I assume that all TreeMap s have the same ordering. 你没有说明排序，但我假设所有TreeMap都有相同的顺序。 In this case you can reduce the outer iteration range by using the bounds of the second map. 在这种情况下，您可以通过使用第二张地图的边界来减小外部迭代范围。 Your inner iteration is completely obsolete as you can simply ask the map whether it contains the key. 您可以简单地询问地图是否包含密钥，因此内部迭代完全过时了。

for(String s: map1.navigableKeySet().subSet(map2.firstKey(), true, map2.lastKey(), true)) {
    if(map2.containsKey(s)) {
        someFunction(map1, map2);
        break;
    }
}

Explanation: 说明：

Suppose you have the following map keys: 假设您有以下映射键：

map2:    D, E, F, G, H
         |           |
       first        last
map1: A,    E,    G,   I
            |<--->|
          subset("D", true, "H", true)

Here, map2 's first element is "D" and its last element is "H" . 这里， map2的第一个元素是"D" ，而最后一个元素是"H" 。 When passing these elements as inclusive bounds to map1's navigableKeySet().subSet(…) method, we'll get the closest inner set ["E", "G"] as search range, hence we have precluded "A" and "I" before we even started our linear search (keep in mind that these are only example placeholders, they might stand for a large number of keys). 当将这些元素作为包含边界传递给map1的navigableKeySet().subSet(…)方法时，我们将得到最接近的内部集["E", "G"]作为搜索范围，因此我们排除了"A"和"I"在我们开始线性搜索之前（请记住，这些只是示例占位符，它们可能代表大量的键）。

By thinking about it even more, you can skip arbitrary ranges in both maps when comparing: 通过考虑更多，您可以在比较时跳过两个地图中的任意范围：

public static boolean haveCommonKeys(TreeMap<String,?> map1, TreeMap<String,?> map2) {
    if(map1.isEmpty()) return false;
    for(String s=map1.firstKey(); s!=null; ) {
        String s2=map2.ceilingKey(s);
        if(s2==null) break;
        if(s2.equals(s)) return true;
        s=map1.ceilingKey(s2);
        if(s2.equals(s)) return true;
    }
    return false;
}

In this solution, we start with the first (smallest) key of a map and ask each map for a key that is the same or bigger than the value we found in the other map. 在此解决方案中，我们从地图的第一个（最小）键开始，并向每个地图询问一个与我们在另一个地图中找到的值相同或更大的键。 This way we will skip all consecutive keys of a map for which the other map contains no in-between key. 这样，我们将跳过地图的所有连续键，其他地图不包含中间键。

Answer 3

Create your own map which contains to every key a set of your objects. 创建自己的地图，其中包含一组对象的每个键。 if you call a getter on a key you will get the set of the objects. 如果你在一个键上调用getter，你将得到一组对象。 if you call size() on this set you know if there are more than one object mapped to this key. 如果在此集合上调用size（），则将知道是否有多个对象映射到此键。 but you shouldnt put all data in one map, because this will slow it down hardcore. 但是您不应该将所有数据都放在一张地图中，因为这会使核心速度变慢。 better you sort your keys if you can. 如果可以的话，最好对密钥进行排序。 like all keys made of numbers in one map, all made of letters in one map and the rest in a third map. 就像在一张地图中所有由数字组成的键，在一张地图中全部由字母组成，其余在第三张地图中一样。 then you can check the key, take the map which belongs to it and work with it. 然后你可以检查密钥，获取属于它的地图并使用它。 like this: 像这样：

public class MyMap{

private Map<String key, Set<Object>> stuff;

 public MyMap(){
  stuff = new HashMap<String key, Set<Object>>(); // Or any other map instead of HashMap
 }

 public void put(final String pKey, final Object pObject){
  Set<Object> objects = stuff.get(pKey);
  if(objects!=null)
   objects.add(pObject);
  else{
   Set<Object> objects = new HashSet<Object>();
   objects.add(pObject);
   stuff.put(pKey, objects);
  }
 }

 public Set<Object> get(String pKey){
  return stuff.get(pKey);
 }

 public void remove(String pKey){
  stuff.remove(pKey);
 }

}

But becareful, this rlly can destroy your performance if you have so much maps. 但是，如果你有这么多的地图，那么这个可能会破坏你的表现。 you have to split the keys up to make it faster :) also you can use any other map/set. 您必须拆分键以使其更快:)也可以使用任何其他映射/集。 i used HashSet because i think you dont want to add the same object twice to the same key if you want to do checks like you told us. 我使用HashSet因为我认为如果你想像你告诉我们那样进行检查，你不想将同一个对象添加到同一个密钥两次。

Hope i could help :) 希望我能帮忙:)

检查两个集合是否包含至少一个相同元素的快速方法

问题描述

3 个解决方案

解决方案1
5 已采纳 2015-02-03 13:33:12

解决方案2
2 2015-02-03 13:58:41

Explanation: 说明：

解决方案3
0 2015-02-03 13:56:28

检查两个集合是否包含至少一个相同元素的快速方法

问题描述

3 个解决方案

解决方案1 5 已采纳 2015-02-03 13:33:12

解决方案2 2 2015-02-03 13:58:41

Explanation: 说明：

解决方案3 0 2015-02-03 13:56:28

解决方案1
5 已采纳 2015-02-03 13:33:12

解决方案2
2 2015-02-03 13:58:41

解决方案3
0 2015-02-03 13:56:28