Java Collection性能问题

Question

我创建了一个方法，它将两个Collection<String>作为输入，并将一个复制到另一个。

但是，我不确定在开始复制之前是否应该检查集合是否包含相同的元素，或者我是否应该只是复制。 这是方法：

 /**
  * Copies from one collection to the other. Does not allow empty string. 
  * Removes duplicates.
  * Clears the too Collection first
  * @param src
  * @param dest
  */
 public static void copyStringCollectionAndRemoveDuplicates(Collection<String> src, Collection<String> dest) {
  if(src == null || dest == null)
   return;

  //Is this faster to do? Or should I just comment this block out
  if(src.containsAll(dest))
   return;

  dest.clear();
  Set<String> uniqueSet = new LinkedHashSet<String>(src.size());
  for(String f : src) 
   if(!"".equals(f)) 
    uniqueSet.add(f);

  dest.addAll(uniqueSet);
 }

也许删除它更快

if(src.containsAll(dest))
    return;

因为这种方法无论如何都会迭代整个集合。

Answer 1

我会说：删除它！ 它是重复的'代码'，Set正在执行相同的'contains（）'操作，因此不需要在此处预处理它。 除非你有一个巨大的输入集合和一个辉煌的O（1）测试containsAll（）;-)

套装足够快。 它具有基于输入大小的O（n）复杂度（一个包含（）和（可能）每个String的一个add（）操作）并且如果target.containsAll（）测试失败，则contains（）完成两次对于每个String - >性能较低。

编辑

一些伪代码可视化我的答案

void copy(source, dest) {
  bool:containsAll = true;
  foreach(String s in source) {  // iteration 1
    if (not s in dest) {         // contains() test
       containsAll=false
       break
    }
  }
  if (not containsAll) {
    foreach(String s in source) { // iteration 2
      if (not s in dest) {        // contains() test
        add s to dest
      }
    }
  }
}

如果所有源元素都在dest中，则对每个源元素调用一次contains（）。 如果除了最后一个源元素之外的所有元素都在dest（最坏情况）中，那么contains（）被调用（2n-1）次（n =源集合的大小）。 但是，使用额外测试的contains（）测试总数总是等于或大于没有额外测试的相同代码。

编辑2让我们假设，我们有以下集合：

source = {"", "a", "b", "c", "c"}
dest = {"a", "b"}

首先，containsAll测试失败，因为源中的空String不在dest中（这是代码中的一个小设计缺陷;））。 然后你创建一个临时集合，它将是{"a", "b", "c"} （空字符串和第二个“c”被忽略）。 最后你添加everthing到dest并假设dest是一个简单的ArrayList，结果是{"a", "b", "a", "b", "c"} 。 这是意图吗？ 更短的选择：

void copy(Collection<String> in, Collection<String> out) {
  Set<String> unique = new HashSet<String>(in);
  in.remove("");
  out.addAll(unique);
}

Answer 2

如果target具有比dest更多的元素，那么containsAll()将无济于事：
目标：[a，b，c，d]
dest：[a，b，c]
target.containsAll(dest)为true，因此dest为[a，b，c]，但应为[a，b，c，d]。

我认为以下代码更优雅：

Set<String> uniqueSet = new LinkedHashSet<String>(target.size());
uniqueSet.addAll(target);
if(uniqueSet.contains(""))
    uniqueSet.remove("");

dest.addAll(uniqueSet);

Answer 3

如果重要的话，你可以对它进行基准测试。 我认为对containsAll()的调用可能没有帮助，尽管它可能取决于两个集合具有相同内容的频率。

但是这段代码令人困惑。 它正在尝试向dest添加新项目？ 那为什么要先清楚呢？ 只需将新的uniqueSet返回给调用者而不是打扰。 并不是你的containsAll()检查反转？

Answer 4

太多令人困惑的参数名称。 dest和target含义几乎相同。 你最好选择像dest和source这样的东西。 即使对你来说，这也会让事情变得更加清晰。
我有一种感觉（不确定它是正确的）你以错误的方式使用集合API。 Interface Collection没有说明其元素的单一性，但是你可以为它添加这种质量。
修改作为参数传递的集合不是最好的主意（但通常，它取决于）。 一般情况下，可变性是有害的，不必要的。 而且，如果传递的集合是不可修改/不可变的呢？ 最好返回新集合，然后修改传入集合。
Collection接口有方法addAll ， removeAll ， retainAll 。 你先试试了吗？ 您是否对代码进行了性能测试，例如：
```
 Collection<String> result = new HashSet<String> (dest); result.addAll (target); 
```
要么
```
 target.removeAll (dest); dest.addAll (target); 
```

Answer 5

代码难以阅读，效率不高。 “dest”参数令人困惑：它作为参数传递，然后将其清除并将结果添加到其中。 它是一个参数有什么意义？ 为什么不简单地返回一个新的集合？ 我能看到的唯一好处是调用者可以确定集合类型。 这有必要吗？

我认为这段代码可以更清晰，更有效地编写如下：

public static Set<String> createSet(Collection<String> source) {
    Set<String> destination = new HashSet<String>(source) {
        private static final long serialVersionUID = 1L;

        public boolean add(String o) {
            if ("".equals(o)) {
                return false;
            }
            return super.add(o);
        }
    }; 
    return destination;
}

另一种方法是创建自己的集类型：

public class NonEmptyStringSet extends HashSet<String> {
    private static final long serialVersionUID = 1L;

    public NonEmptyStringSet() {
        super();
    }

    public NonEmptyStringSet(Collection<String> source) {
        super(source);
    }

    public boolean add(String o) {
        if ("".equals(o)) {
            return false;
        }
        return super.add(o);
    }
}

用法：

createSet(source);
new NonEmptyStringSet(source);

返回集合的性能更高，因为您不必先创建临时集，然后将所有内容添加到dest集合中。

NonEmptyStringSet类型的好处是您可以继续添加字符串并仍然具有空字符串检查。

EDIT1：

删除“if（src.containsAll（dest））return;” 使用source == dest调用方法时，代码引入了“bug”; 结果是源将为空。 例：

Collection<String> source = new ArrayList<String>();
source.add("abc");
copyStringCollectionAndRemoveDuplicates(source, source);
System.out.println(source);

EDIT2：

我做了一个小的基准测试，表明我的实现比初始实现的简化版快约30％。 此基准测试是初始实现的最佳案例，因为dest集合为空，因此不必清除它。 另外请注意，我的实现使用HashSet而不是LinkedHashSet，这使我的实现更快一些。

基准代码：

public class SimpleBenchmark {
public static void main(String[] args) {
    Collection<String> source = Arrays.asList("abc", "def", "", "def", "", 
            "jsfldsjdlf", "jlkdsf", "dsfjljka", "sdfa", "abc", "dsljkf", "dsjfl", 
            "js52fldsjdlf", "jladsf", "dsfjdfgljka", "sdf123a", "adfgbc", "dslj452kf", "dsjfafl", 
            "js21ldsjdlf", "jlkdsvbxf", "dsfjljk342a", "sdfdsa", "abxc", "dsljkfsf", "dsjflasd4" );

    int runCount = 1000000;
    long start1 = System.currentTimeMillis();
    for (int i = 0; i < runCount; i++) {
        copyStringCollectionAndRemoveDuplicates(source, new ArrayList<String>());
    }
    long time1 = (System.currentTimeMillis() - start1);
    System.out.println("Time 1: " + time1);


    long start2 = System.currentTimeMillis();
    for (int i = 0; i < runCount; i++) {
        new NonEmptyStringSet(source);
    }
    long time2 = (System.currentTimeMillis() - start2);
    System.out.println("Time 2: " + time2);

    long difference = time1 - time2;
    double percentage = (double)time2 / (double) time1;

    System.out.println("Difference: " + difference + " percentage: " + percentage);
}

public static class NonEmptyStringSet extends HashSet<String> {
    private static final long serialVersionUID = 1L;

    public NonEmptyStringSet() {
    }

    public NonEmptyStringSet(Collection<String> source) {
        super(source);
    }

    @Override
    public boolean add(String o) {
        if ("".equals(o)) {
            return false;
        }
        return super.add(o);
    }
}

public static void copyStringCollectionAndRemoveDuplicates(
        Collection<String> src, Collection<String> dest) {
    Set<String> uniqueSet = new LinkedHashSet<String>(src.size());
    for (String f : src)
        if (!"".equals(f))
            uniqueSet.add(f);

    dest.addAll(uniqueSet);
}
}

Answer 6

我真的不认为我理解为什么你会想要这个方法，但假设它是值得的，我会按如下方式实现它：

public static void copyStringCollectionAndRemoveDuplicates(
        Collection<String> src, Collection<String> dest) {
    if (src == dest) {
         throw new IllegalArgumentException("src == dest");
    }
    dest.clear();
    if (dest instanceof Set) {
        dest.addAll(src);
        dest.remove("");
    } else if (src instance of Set) {
        for (String s : src) {
            if (!"".equals(s)) {
                dest.add(s);
            }
        }
    } else {
        HashSet<String> tmp = new HashSet<String>(src);
        tmp.remove("");
        dest.addAll(tmp);
    }
}

笔记：

在所有情况下，这都不会保留src参数中元素的顺序，但方法签名意味着这是无关紧要的。
我故意不检查null。 如果将null作为参数提供，则是一个错误，正确的做法是允许抛出NullPointerException 。
尝试将集合复制到自身也是一个错误。

Java Collection性能问题

问题描述

6 个解决方案

解决方案1
7 已采纳 2010-05-27 08:09:39

解决方案2
3 2010-05-27 08:20:36

解决方案3
2 2010-05-27 08:10:22

解决方案4
1 2010-05-27 08:31:01

解决方案5
1 2010-05-27 09:36:25

解决方案6
0 2010-05-27 10:31:13

Java Collection性能问题

问题描述

6 个解决方案

解决方案1 7 已采纳 2010-05-27 08:09:39

解决方案2 3 2010-05-27 08:20:36

解决方案3 2 2010-05-27 08:10:22

解决方案4 1 2010-05-27 08:31:01

解决方案5 1 2010-05-27 09:36:25

解决方案6 0 2010-05-27 10:31:13

解决方案1
7 已采纳 2010-05-27 08:09:39

解决方案2
3 2010-05-27 08:20:36

解决方案3
2 2010-05-27 08:10:22

解决方案4
1 2010-05-27 08:31:01

解决方案5
1 2010-05-27 09:36:25

解决方案6
0 2010-05-27 10:31:13