简体   繁体   English

Vector和HashSet之间的巨大性能差异

[英]Huge performance difference between Vector and HashSet

I have a program which fetches records from database (using Hibernate) and fills them in a Vector . 我有一个程序从数据库中获取记录(使用Hibernate)并在Vector填充它们。 There was an issue regarding the performance of the operation and I did a test with the Vector replaced by a HashSet . 有关操作性能的问题,我做了一个测试, VectorHashSet取代。 With 300000 records, the speed gain is immense - 45 mins to 2 mins! 拥有300000条记录,速度增加非常快 - 45分钟到2分钟!

So my question is, what is causing this huge difference? 所以我的问题是,是什么造成了巨大的差异? Is it just the point that all methods in Vector are synchronized or the point that internally Vector uses an array whereas HashSet does not? 仅仅是Vector中的所有方法都是同步的,还是内部Vector使用数组的点,而HashSet则没有? Or something else? 或者是其他东西?

The code is running in a single thread. 代码在单个线程中运行。

EDIT : The code is only inserting the values in the Vector (and in the other case, HashSet ). 编辑 :代码只插入Vector的值(在另一种情况下, HashSet )。

If it's trying to use the Vector as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet . 如果它试图使用Vector 作为集合,并在添加它之前检查记录的存在,那么填充向量变为O(n ^ 2)操作,与HashSet O(n)相比。 It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end. 如果在向量的开头而不是在结尾插入每个元素,它也将成为O(n ^ 2)操作。

If you're just using collection.add(item) then I wouldn't expect to see that sort of difference - synchronization isn't that slow. 如果您只是使用collection.add(item)那么我不希望看到那种差异 - 同步并不那么慢。

If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on. 如果您可以尝试使用不同数量的记录进行测试,您可以看到每个版本随着n的增加而增长 - 这样可以更容易地计算出正在发生的事情。

EDIT: If you're just using Vector.add then it sounds like something else could be going on - eg your database was behaving differently between your different test runs. 编辑:如果你只是使用Vector.add然后听起来像其他东西可能会发生 - 例如你的数据库在不同的测试运行之间表现不同。 Here's a little test application: 这是一个小测试应用程序:

import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
      vector.add("dummy value");
    }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

Output: 输出:

Time taken: 38ms 所用时间:38毫秒

Now obviously this isn't going to be very accurate - System.currentTimeMillis isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. 现在显然这不是很准确 - System.currentTimeMillis不是获得准确计时的最佳方式 - 但显然不需要45分钟。 In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item) . 换句话说,如果你真的只是调用Vector.add(item) ,你应该在别处寻找问题。

Now, changing the code above to use 现在,更改上面的代码使用

vector.add(0, "dummy value"); // Insert item at the beginning

makes an enormous difference - it takes 42 seconds instead of 38ms. 产生了巨大的差异 - 需要42 而不是38毫秒。 That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours. 这显然要糟糕得多 - 但距离45分钟还有很长的路要走 - 我怀疑我的桌面速度是你的60倍。

If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. 如果您将它们插入中间或开头而不是最后,那么Vector需要一直移动它们。 Every insert. 每个插页。 The hashmap, on the other hand, doesn't really care or have to do anything. 另一方面,hashmap并不关心或不必做任何事情。

Vector is outdated and should not be used anymore. 矢量已过时,不应再使用。 Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync). 使用ArrayList或LinkedList配置文件(取决于您使用列表的方式),您将看到差异(sync vs unsync)。 Why are you using Vector in a single threaded application at all? 你为什么要在单线程应用程序中使用Vector?

import java.util.*;

public class Test {
  public static void main(String[] args) {
    long start = System.currentTimeMillis();
    Vector<String> vector = new Vector<String>();
    for (int i = 0; i < 300000; i++) {
       if(vector.contains(i)) {
         vector.add("dummy value");
       }
     }
    long end = System.currentTimeMillis();
    System.out.println("Time taken: " + (end - start) + "ms");
  }
}

If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. 如果在向量中插入元素之前检查重复元素,则需要更多时间取决于向量的大小。 best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting. 最好的方法是使用HashSet以获得高性能,因为Hashset不允许重复,也不需要在插入之前检查重复元素。

Vector is synchronized by default; Vector默认是同步的; HashSet is not. HashSet不是。 That's my guess. 这是我的猜测。 Obtaining a monitor for access takes time. 获取访问监视器需要时间。

I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get() is used to access Vector entries. 我不知道你的测试中是否有读取,但如果使用get()来访问Vector条目,则Vector和HashSet都是O(1)。

Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector will take 43 minutes longer than inserting the same records into a HashSet . 在正常情况下,将300,000条记录插入Vector时间比将相同的记录插入HashSet要长43分钟是完全不可信的

However, I think there is a possible explanation of what might be going on. 但是,我认为有可能解释可能发生的事情。

First, the records coming out of the database must have a very high proportion of duplicates. 首先,来自数据库的记录必须具有非常高比例的重复。 Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class. 或者至少,它们必须根据记录类的equals / hashcode方法的语义重复。

Next, I think you must be pushing very close to filling up the heap. 接下来,我认为你必须非常接近填满堆。

So the reason that the HashSet solution is so much faster is that it is most of the records are being replaced by the set.add operation. 因此, HashSet解决方案的速度要快得多,因为大多数记录都被set.add操作所取代 By contrast the Vector solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05% of memory by running the GC over, and over and over. 相比之下, Vector解决方案保留了所有记录,而JVM大部分时间都是通过运行GC来反复尝试挤压最后0.05%的内存。

One way to test this theory is to run the Vector version of the application with a much bigger heap. 测试该理论的一种方法是使用更大的堆来运行应用程序的Vector版本。


Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going. 无论如何,调查此类问题的最佳方法是使用分析器运行应用程序,并查看所有CPU时间的去向。

According to Dr Heinz Kabutz, he said this in one of his newsletters . 根据Heinz Kabutz博士的说法,他在他的一份时事通讯中说过这一点。

The old Vector class implements serialization in a naive way. 旧的Vector类以一种天真的方式实现序列化。 They simply do the default serialization, which writes the entire Object[] as-is into the stream. 它们只是执行默认序列化,将整个Object[]按原样写入流中。 Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous. 因此,如果我们将一堆元素插入List,然后清除它,Vector和ArrayList之间的差异是巨大的。

import java.util.*;
import java.io.*;

public class VectorWritingSize {
  public static void main(String[] args) throws IOException {
    test(new LinkedList<String>());
    test(new ArrayList<String>());
    test(new Vector<String>());
  }

  public static void test(List<String> list) throws IOException {
    insertJunk(list);
    for (int i = 0; i < 10; i++) {
      list.add("hello world");
    }
    ByteArrayOutputStream baos = new ByteArrayOutputStream();
    ObjectOutputStream out = new ObjectOutputStream(baos);
    out.writeObject(list);
    out.close();
    System.out.println(list.getClass().getSimpleName() +
        " used " + baos.toByteArray().length + " bytes");
  }

  private static void insertJunk(List<String> list) {
    for(int i = 0; i<1000 * 1000; i++) {
      list.add("junk");
    }
    list.clear();
  }
}

When we run this code, we get the following output: 当我们运行此代码时,我们得到以下输出:

LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes

Vector can use a staggering amount of bytes when being serialized. Vector在序列化时可以使用惊人的字节数。 The lesson here? 这里有什么教训? Don't ever use Vector as Lists in objects that are Serializable . 不要在可序列化的对象中使用Vector作为列表 The potential for disaster is too great. 灾难的可能性太大了。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

相关问题 HashSet 和 HashMap 的区别? - Difference between HashSet and HashMap? 在Java中,HashSet有什么区别 <Integer> =新的HashSet(2)和HashSet <Integer> =新的HashSet <Integer> (2)? - In Java, what's the difference between HashSet<Integer> = new HashSet(2) and HashSet<Integer> = new HashSet<Integer>(2)? 各种小HashSet和1个大HashSet之间的搜索区别是什么? - What is the Searching difference between various small HashSet and 1 large HashSet? 将集合类型转换为 HashSet 和使用集合初始化 HashSet 有什么区别? - What is the difference between type casting a set to HashSet and initializing a HashSet with a set? HashSet和Set之间有什么区别? - What's the difference between HashSet and Set? LinkedList、HashSet 和 HashMap 之间的主要区别是什么? - What is the main difference between LinkedList, HashSet and HashMap? 将WrappedString和String添加到HashSet之间的区别 - Difference between adding WrappedString and String into HashSet HashSet和LinkedHashSet有什么区别 - what's the difference between HashSet and LinkedHashSet 使用链表和数组构造具有邻接表的图形之间的巨大性能差异 - Huge performance difference between using linkedlist and array for constructing a graph with adjacency list Java与相同测试的巨大性能差异 - Java huge Performance difference with same test
 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM