[英]Huge performance difference between Vector and HashSet
I have a program which fetches records from database (using Hibernate) and fills them in a Vector
. 我有一个程序从数据库中获取记录(使用Hibernate)并在
Vector
填充它们。 There was an issue regarding the performance of the operation and I did a test with the Vector
replaced by a HashSet
. 有关操作性能的问题,我做了一个测试,
Vector
被HashSet
取代。 With 300000 records, the speed gain is immense - 45 mins to 2 mins! 拥有300000条记录,速度增加非常快 - 45分钟到2分钟!
So my question is, what is causing this huge difference? 所以我的问题是,是什么造成了巨大的差异? Is it just the point that all methods in
Vector
are synchronized or the point that internally Vector
uses an array whereas HashSet
does not? 仅仅是
Vector
中的所有方法都是同步的,还是内部Vector
使用数组的点,而HashSet
则没有? Or something else? 或者是其他东西?
The code is running in a single thread. 代码在单个线程中运行。
EDIT : The code is only inserting the values in the Vector
(and in the other case, HashSet
). 编辑 :代码只插入
Vector
的值(在另一种情况下, HashSet
)。
If it's trying to use the Vector
as a set, and checking for the existence of a record before adding it, then filling the vector becomes an O(n^2) operation, compared with O(n) for HashSet
. 如果它试图使用
Vector
作为集合,并在添加它之前检查记录的存在,那么填充向量变为O(n ^ 2)操作,与HashSet
O(n)相比。 It would also become an O(n^2) operation if you insert each element at the start of the vector instead of at the end. 如果在向量的开头而不是在结尾插入每个元素,它也将成为O(n ^ 2)操作。
If you're just using collection.add(item)
then I wouldn't expect to see that sort of difference - synchronization isn't that slow. 如果您只是使用
collection.add(item)
那么我不希望看到那种差异 - 同步并不那么慢。
If you can try to test it with different numbers of records, you could see how each version grows as n increases - that would make it easier to work out what's going on. 如果您可以尝试使用不同数量的记录进行测试,您可以看到每个版本随着n的增加而增长 - 这样可以更容易地计算出正在发生的事情。
EDIT: If you're just using Vector.add
then it sounds like something else could be going on - eg your database was behaving differently between your different test runs. 编辑:如果你只是使用
Vector.add
然后听起来像其他东西可能会发生 - 例如你的数据库在不同的测试运行之间表现不同。 Here's a little test application: 这是一个小测试应用程序:
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
vector.add("dummy value");
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
Output: 输出:
Time taken: 38ms
所用时间:38毫秒
Now obviously this isn't going to be very accurate - System.currentTimeMillis
isn't the best way of getting accurate timing - but it's clearly not taking 45 minutes. 现在显然这不是很准确 -
System.currentTimeMillis
不是获得准确计时的最佳方式 - 但显然不需要45分钟。 In other words, you should look elsewhere for the problem, if you really are just calling Vector.add(item)
. 换句话说,如果你真的只是调用
Vector.add(item)
,你应该在别处寻找问题。
Now, changing the code above to use 现在,更改上面的代码使用
vector.add(0, "dummy value"); // Insert item at the beginning
makes an enormous difference - it takes 42 seconds instead of 38ms. 产生了巨大的差异 - 需要42 秒而不是38毫秒。 That's clearly a lot worse - but it's still a long way from being 45 minutes - and I doubt that my desktop is 60 times as fast as yours.
这显然要糟糕得多 - 但距离45分钟还有很长的路要走 - 我怀疑我的桌面速度是你的60倍。
If you are inserting them at the middle or beginning instead of at the end, then the Vector needs to move them all along. 如果您将它们插入中间或开头而不是最后,那么Vector需要一直移动它们。 Every insert.
每个插页。 The hashmap, on the other hand, doesn't really care or have to do anything.
另一方面,hashmap并不关心或不必做任何事情。
Vector is outdated and should not be used anymore. 矢量已过时,不应再使用。 Profile with ArrayList or LinkedList (depends on how you use the list) and you will see the difference (sync vs unsync).
使用ArrayList或LinkedList配置文件(取决于您使用列表的方式),您将看到差异(sync vs unsync)。 Why are you using Vector in a single threaded application at all?
你为什么要在单线程应用程序中使用Vector?
import java.util.*;
public class Test {
public static void main(String[] args) {
long start = System.currentTimeMillis();
Vector<String> vector = new Vector<String>();
for (int i = 0; i < 300000; i++) {
if(vector.contains(i)) {
vector.add("dummy value");
}
}
long end = System.currentTimeMillis();
System.out.println("Time taken: " + (end - start) + "ms");
}
}
If you check for duplicate element before insert the element in the vector, it will take more time depend upon the size of vector. 如果在向量中插入元素之前检查重复元素,则需要更多时间取决于向量的大小。 best way is to use the HashSet for high performance, because Hashset will not allow duplicate and no need to check for duplicate element before inserting.
最好的方法是使用HashSet以获得高性能,因为Hashset不允许重复,也不需要在插入之前检查重复元素。
Vector is synchronized by default; Vector默认是同步的; HashSet is not.
HashSet不是。 That's my guess.
这是我的猜测。 Obtaining a monitor for access takes time.
获取访问监视器需要时间。
I don't know if there are reads in your test, but Vector and HashSet are both O(1) if get()
is used to access Vector entries. 我不知道你的测试中是否有读取,但如果使用
get()
来访问Vector条目,则Vector和HashSet都是O(1)。
Under normal circumstances, it is totally implausible that inserting 300,000 records into a Vector
will take 43 minutes longer than inserting the same records into a HashSet
. 在正常情况下,将300,000条记录插入
Vector
时间比将相同的记录插入HashSet
要长43分钟是完全不可信的 。
However, I think there is a possible explanation of what might be going on. 但是,我认为有可能解释可能发生的事情。
First, the records coming out of the database must have a very high proportion of duplicates. 首先,来自数据库的记录必须具有非常高比例的重复。 Or at least, they must be duplicates according to the semantics of the equals/hashcode methods of your record class.
或者至少,它们必须根据记录类的equals / hashcode方法的语义重复。
Next, I think you must be pushing very close to filling up the heap. 接下来,我认为你必须非常接近填满堆。
So the reason that the HashSet
solution is so much faster is that it is most of the records are being replaced by the set.add
operation. 因此,
HashSet
解决方案的速度要快得多,因为大多数记录都被set.add
操作所取代 。 By contrast the Vector
solution is keeping all of the records, and the JVM is spending most of its time trying to squeeze that last 0.05%
of memory by running the GC over, and over and over. 相比之下,
Vector
解决方案保留了所有记录,而JVM大部分时间都是通过运行GC来反复尝试挤压最后0.05%
的内存。
One way to test this theory is to run the Vector
version of the application with a much bigger heap. 测试该理论的一种方法是使用更大的堆来运行应用程序的
Vector
版本。
Irrespective, the best way to investigate this kind of problem is to run the application using a profiler, and see where all the CPU time is going. 无论如何,调查此类问题的最佳方法是使用分析器运行应用程序,并查看所有CPU时间的去向。
According to Dr Heinz Kabutz, he said this in one of his newsletters . 根据Heinz Kabutz博士的说法,他在他的一份时事通讯中说过这一点。
The old Vector class implements serialization in a naive way. 旧的Vector类以一种天真的方式实现序列化。 They simply do the default serialization, which writes the entire
Object[]
as-is into the stream. 它们只是执行默认序列化,将整个
Object[]
按原样写入流中。 Thus if we insert a bunch of elements into the List, then clear it, the difference between Vector and ArrayList is enormous. 因此,如果我们将一堆元素插入List,然后清除它,Vector和ArrayList之间的差异是巨大的。
import java.util.*;
import java.io.*;
public class VectorWritingSize {
public static void main(String[] args) throws IOException {
test(new LinkedList<String>());
test(new ArrayList<String>());
test(new Vector<String>());
}
public static void test(List<String> list) throws IOException {
insertJunk(list);
for (int i = 0; i < 10; i++) {
list.add("hello world");
}
ByteArrayOutputStream baos = new ByteArrayOutputStream();
ObjectOutputStream out = new ObjectOutputStream(baos);
out.writeObject(list);
out.close();
System.out.println(list.getClass().getSimpleName() +
" used " + baos.toByteArray().length + " bytes");
}
private static void insertJunk(List<String> list) {
for(int i = 0; i<1000 * 1000; i++) {
list.add("junk");
}
list.clear();
}
}
When we run this code, we get the following output: 当我们运行此代码时,我们得到以下输出:
LinkedList used 107 bytes
ArrayList used 117 bytes
Vector used 1310926 bytes
Vector can use a staggering amount of bytes when being serialized. Vector在序列化时可以使用惊人的字节数。 The lesson here?
这里有什么教训? Don't ever use Vector as Lists in objects that are Serializable .
不要在可序列化的对象中使用Vector作为列表 。 The potential for disaster is too great.
灾难的可能性太大了。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.