简体   繁体   English

Java堆空间:Hashmap,ArrayList

[英]Java heap space: Hashmap, ArrayList

I would like to process a text file (about 400 MB) in order to create a recursive parent-child-structure from the data given in each line. 我想处理一个文本文件(大约400 MB),以便根据每一行中给出的数据创建一个递归的父子结构。 The data have to be prepared for a top down navigation (input: parent, output: all children and sub children). 数据必须准备好进行自上而下的导航(输入:父级,输出:所有子级和子级子级)。 Eg of lines to be read: ( child ,id1,id2, parent ,id3) 例如,要读取的行:( child ,id1,id2, parent ,id3)

132142086 ;1;2; 132142086 ; 1; 2; 132528589 ;132528599 132528589 ; 132528599
132142087 ;1;3; 132142087 ; 1; 3; 132528589 ;132528599 132528589 ; 132528599
132142088 ;1;0; 132142088 ; 1; 0; 132528589 ;132528599 132528589 ; 132528599
323442444 ;1;0; 323442444 ; 1; 0; 132142088 ;132528599 132142088 ; 132528599
454345434 ;1;0; 454345434 ; 1; 0; 323442444 ;132528599 323442444 ; 132528599

132528589: is parent of 132142086,132142087,132142088 132528589:是132142086、132142087、132142088的父级
132142088: is parent of 323442444 132142088:是323442444的父项
323442444: is parent of 454345434 323442444:是454345434的父母

Given: OS windows xp, 32bit, 2GB available Memory and -Xmx1024m Here is the way I prepare the data: 给定:OS Windows XP,32位,2GB可用内存和-Xmx1024m,这是我准备数据的方式:

HashMap<String,ArrayList<String>> hMap=new HashMap<String,ArrayList<String>>();
  while ((myReader = bReader.readLine()) != null) 
          {
             String [] tmpObj=myReader.split(delimiter);
                   String valuesArrayS=tmpObj[0]+";"+tmpObj[1]+";"+tmpObj[2]+";"+tmpObj[3]+";"+tmpObj[4];
                        ArrayList<String> valuesArray=new ArrayList<String>();
                        //case of same key
                        if(hMap.containsKey(tmpObj[3]))
                            {
                            valuesArray=(ArrayList<String>)(hMap.get(tmpObj[3])).clone();
                            }

                        valuesArray.add(valuesArrayS);
                        hMap.put(tmpObj[3],valuesArray);
                        tmpObj=null;
                        valuesArray=null;
                        }

return hMap;

After then I use a recursive function: 之后,我使用递归函数:

HashMap<String,ArrayList<String>> getChildren(input parent)

for creating the data structure needed. 用于创建所需的数据结构。 The plan is to let the hMap available (read only) for more than one thread using the function getChildren. 计划是使用getChildren函数使hMap在多个线程中可用(只读)。
I tested this program with an input file of 90 MB and it seemed to work properly. 我用90 MB的输入文件测试了该程序,它似乎可以正常工作。 However, running it with the real file with more than 380 MB lead to: 但是,使用超过380 MB的实际文件运行它会导致:
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space 线程“主”中的异常java.lang.OutOfMemoryError:Java堆空间
I need some help in memory resource management 我在内存资源管理方面需要一些帮助

From the "dirt-simple approach" side of things: Based on your problem statement, you don't need to keep id1, id2, or id3 around. 从“污点简单方法”的角度来看:根据问题陈述,您无需保留id1,id2或id3。 Assuming that's the case, how about replacing your HashMap<String, ArrayList<String>> with a HashMap<Integer, ArrayList<Integer>> ? 假设是这种情况,如何用HashMap<Integer, ArrayList<Integer>>替换HashMap<String, ArrayList<String>> HashMap<Integer, ArrayList<Integer>> You can use Integer.parseInt() to do the string-to-int conversion, and an Integer should always be smaller than the corresponding String. 您可以使用Integer.parseInt()进行字符串到整数的转换,并且Integer始终应小于相应的String。

Other suggestions: replace your ArrayList with a HashSet if you don't care about duplicates. 其他建议:如果您不关心重复项,则用HashSet替换ArrayList

Per outofBounds' answer, you don't need to clone an ArrayList every time you want to add an item to it. 根据outofBounds的答案,您不需要每次都想向其添加项目时克隆ArrayList

Do check out increasing your memory, as suggested by others. 请按照其他人的建议检查一下增加您的记忆力。 Also, you can store your data within the table better as suggested by Sbodd and others. 而且,您可以按照Sbodd等人的建议,将数据更好地存储在表中。

However, you may be running afoul of memory fragmentation. 但是,您可能正在运行内存碎片。 Hash maps use arrays. 哈希映射使用数组。 Big hash maps use big arrays. 大哈希图使用大数组。 You are not specifying the size of your hashmap, so every time it decides it needs to be bigger, it discards its old array and allocates a new one. 您没有指定哈希图的大小,因此每次它决定需要更大时,它都会丢弃其旧数组并分配一个新数组。 After a while, your memory will fill up with discarded hash table arrays and you get an OutOfMemoryException even though you technically have plenty of free memory. 稍后,您的内存将被废弃的哈希表数组填满,即使您在技术上有足够的可用内存,您也会收到OutOfMemoryException。 (90% of your memory could be available, but in pieces too small to use.) (您的内存有90%可用,但碎片太小而无法使用。)

The garbage collector (GC) will work continuously to combine all these free bits into blocks big enough to use. 垃圾收集器(GC)将连续工作以将所有这些自由位组合成足够大的块以供使用。 If your program ran slowly enough, you would not have a problem, but your program is running full tilt and the GC is going to get behind. 如果您的程序运行得足够慢,那么您就不会有问题,但是您的程序正在全速运行,GC将会落后。 The GC will throw the exception if it cannot assemble a free block big enough fast enough; 如果GC无法足够快地组装自由块,则GC将引发异常。 the mere fact that the memory exists will not stop it. 内存存在的事实并不会阻止它。 (This means that a program that could run won't, but it keeps the JVM from running real slow and looking real bad to users.) (这意味着可以运行的程序不会运行,但它会使JVM的运行速度变慢,并且看起来对用户不利。)

Given that you know how big your hash map has to be, I'd set the size up front. 既然您知道您的哈希映射必须有多大,那么我将预先设置大小。 Even if the size isn't precisely right, it may solve your memory problem without increasing the heap size and will definitely make your program run faster (or as fast as your file read lets it--use big file buffers). 即使大小不完全正确,它也可以解决内存问题而不增加堆大小,并且肯定会使程序运行更快(或与文件读取允许的速度一样快-使用文件缓冲区)。

If you have no real idea how big your table might be, use a TreeMap. 如果您不知道表有多大,请使用TreeMap。 It's a bit slower but does not allocate huge arrays and is hence a lot kinder to the GC. 它稍微慢一点,但是不分配巨大的数组,因此对GC友好得多。 I find them a lot more flexible and useful. 我觉得他们有了更多灵活的和有用的。 You might even look at the ConcurrentSkipTreeMap, which is slower than the TreeMap, but lets you add and read and delete from multiple threads simultaneously. 您甚至可以查看ConcurrentSkipTreeMap,它比TreeMap慢,但是可以让您同时在多个线程中进行添加,读取和删除。

But your best bet is something like: 但是最好的选择是:

hMap = new HashMap<String,ArrayList<String>>( 10000000 );

Inside your While loop u can reduce some Space something like this 在While循环内,您可以减少一些空间,例如

String [] tmpObj=myReader.split(delimiter);
// String = String + String takes more Space than String.format(...)
//String valuesArrayS=tmpObj[0]+";"+tmpObj[1]+";"+tmpObj[2]+";"+tmpObj[3]+";"+tmpObj[4];

// Just Adding if thers is no List for a Key
if(!hMap.containsKey(tmpObj[3]){
    hMap.put(tmpObj[3], new ArrayList<String>());
}
// Gettin the list from the Map and adding the new stuff
List<String> values = hMap.get(tmpObj[3]);
values.add(String.format("%s;%s;%s;%s;%s",tmpObj[0], tmpObj[1], tmpObj[2], tmpObj[3], tmpObj[4]));

no need to Clone the List 无需克隆列表

You are really testing the boundaries of what one can doing with 1GB of memory. 您实际上正在测试1GB内存可以做什么的界限。

You could: 你可以:

  1. Increase Heap Space. 增加堆空间。 32 bit windows will limit you to ~1.5GB, but you still have a little more wiggle room it might be enough to put you over the top. 32位窗口会将您限制在〜1.5GB,但是您仍有更多的摆动空间,可能足以使您处于领先地位。
  2. Build some type of pre-processor utility that Pre-partitions the file in sizes you know to work and operates on them one at a time, perhaps hierachically. 构建某种类型的预处理器实用程序,以已知的大小对文件进行预分区,然后一次对每个文件进行一次操作,也许是在层次上。
  3. Try re-structuring your program. 尝试重新构建程序。 It has lots of splitting and concatenating going on. 它进行了很多拆分和连接。 In java strings are immutable and when you split strings and concatenate with + operators you are creating new Strings all the time(9 out of 10 cases this doesn't matter, but in your case where you are working with a very limited set of resources it might make a difference) 在Java中,字符串是不可变的,当您拆分字符串并使用+运算符进行连接时,您始终会创建新的字符串(10种情况中的9种无关紧要,但是在您使用的资源非常有限的情况下,可能会有所不同)

As a side less helpful note. 顺便说一句,没有多大帮助。 The real issue here is that you just don't have the resources to tackle this task and optimization is only going to take you so far. 真正的问题是,您只是没有足够的资源来解决此任务,而优化只会使您步入正轨。 Its like asking how to better tunnel through a mountain with a garden trowel. 这就像问如何用花园抹子更好地穿过一座山。 The real answer is probably the one you don't want to hear which is throw away the trowel and invest in some industrial equipment 真正的答案可能是您不想听到的答案,它被扔掉了抹泥刀并投资了一些工业设备

On a second more helpful note(and fun if you're like me) - you may try hooking jVisualVM up to your application and trying to understand where you heap is going or use jhat and the -XX:+HeapDumpOnOutOfMemoryError jvm flag to see what was happening with the heap at crash time. 在第二点更有用的说明上(如果您像我一样,这很有趣)-您可以尝试将jVisualVM挂接到您的应用程序,并尝试了解堆的去向 ,或者使用jhat-XX:+HeapDumpOnOutOfMemoryError jvm标志来查看什么在崩溃时发生在堆上。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM