简体繁体 English

Java解析大型文本文件

[英]Java- Parsing a large text file

原文 2013-04-08 07:43:18 9 3 java/ parsing/ text

I had a quick question. 我有一个简短的问题。 I'm working on a school project and I need to parse an extremely large text file. 我正在做一个学校项目，我需要解析一个非常大的文本文件。 It's for a database class, so I need to get unique actor names from the file because actors will be a primary key in the mysql database. 它用于数据库类，因此我需要从文件中获取唯一的参与者名称，因为参与者将是mysql数据库中的主键。 I've already written the parser and it works great, but at the time I forgot to remove the duplicates. 我已经编写了解析器，并且效果很好，但是当时我忘记删除重复项。 So, I decided the easiest way would be to create an actors arraylist. 因此，我认为最简单的方法是创建一个actor arraylist。 (Using ArrayList ADT) Then use the contain() method to check if the actor name is in the arraylist before I print it to a new text file. （使用ArrayList ADT）然后在将其打印到新的文本文件之前，使用contain（）方法检查参与者名称是否在arraylist中。 If it is I do nothing, if it isn't I add it to the arraylist and print to the page. 如果是，我什么也不做，如果不是，我将其添加到arraylist并打印到页面。 Now the program is running extremely slow. 现在程序运行非常慢。 Before the arraylist, it took about 5 minutes. 在使用arraylist之前，大约花费了5分钟。 The old actor file was 180k without duplicates removed. 旧的actor文件为180k，没有删除重复项。 Now its been running for 30 minutes and at 12k so far. 现在，它已经运行了30分钟，到目前为止已达到12,000。 (I'm expecting 100k-150k total this time.) （我预计这次总共要100k-150k。）

I left the size of the arraylist blank because I dont know how many actors are in the file, but at least 1-2 million. 我将arraylist的大小留为空白，因为我不知道文件中有多少参与者，但至少有1-2百万。 I was thinking of just putting 5 million in for its size and checking to see if it got them all after. 我当时只是想为它的大小投入500万，然后检查它是否能满足所有需求。 (Simply check last arraylist index and if empty, it didnt run out of space.) Would this reduce time because the arraylist isnt redoubling constantly and recopying everything over? （只需检查最后一个arraylist索引，如果为空，它就不会用完空间。）这会减少时间，因为arraylist不会不断地加倍重复并重新复制所有内容吗？ Is there another method which would be faster than this? 还有另一种方法会比这更快吗？ I'm also concerned my computer might run out of memory before it completes. 我还担心我的计算机在完成之前可能会耗尽内存。 Any advice would be great. 任何建议都很好。

(Also I did try running 'unique' command on the text file without success. The actor names print out 1 per line. (in one column) I was thinking maybe the command was wrong. How would you remove duplicates from a text file column in a windows or linux command prompt?) Thank you and sorry for the long post. （我也曾尝试在文本文件上运行'unique'命令，但没有成功。演员名称每行打印1条。（在一列中）我在想该命令是错误的。如何从文本文件列中删除重复项在Windows或linux命令提示符下？）谢谢，很抱歉。 I have a midterm tomorrow and starting to get stressed. 我明天有一个期中考试，开始感到压力。

3 个解决方案

Use Set instead of List so you don't have to check if the collection contains the element. 使用Set而不是List，因此您不必检查集合是否包含元素。 Set doesn't allow duplicates. 设置不允许重复。

Cost of lookup using arrayList contains() gives you roughly O(n) performance. 使用arrayList contains（）查找的成本大约为O（n）。 Doing this a million times is what I think, killing your program. 我认为这样做一百万遍，这会杀死您的程序。

Use a HashSet implementation of Set. 使用Set的HashSet实现。 It will afford you theoretically constant time lookup and will automatically remove duplicates for you. 从理论上讲，它将为您提供恒定的时间查找，并会自动为您删除重复项。

-try using memory mapped file in java for faster access to the large file -尝试在Java中使用内存映射文件以更快地访问大文件

-and instead of ArrayList use a HashMap collection where the key is the actor's name (or the hash code) this will improve a lot the speed since the look-up of a key in a HashMap is very fast -而不是ArrayList使用HashMap集合，其中键是参与者的名字（或哈希码），这将大大提高速度，因为在HashMap中查找键非常快