简体   繁体   English

用Java排序大文件

[英]Sort huge file in java

I've huge file with unique words in each line. 我的文件很大,每一行都有唯一的单词。 Size of file is around 1.6 GB(I've to sort other files after this which are around 15GB). 文件大小约为1.6 GB(此后我必须对其他文件进行排序,约为15 GB)。 Till now, for smaller files I used Array.sort() . 到目前为止,对于较小的文件,我使用Array.sort() But for this file I get java.lang.OutOfMemoryError: Java heap space . 但是对于这个文件,我得到了java.lang.OutOfMemoryError: Java heap space I know the reason for this error. 我知道此错误的原因。 Is there any way instead of writing complete quick sort or merge sort program. 有什么办法可以代替编写完整的快速排序或合并排序程序。

I read that Array.sort() uses Quicksort or Hybrid Sort internally. 我读到Array.sort()在内部使用Quicksort或Hybrid Sort。 Is there any procedure like Array.sort() ?? 有没有像Array.sort()这样的程序?

If I have to write a program for sorting, which one should I use? 如果必须编写用于排序的程序,应该使用哪个程序? Quicksort or Merge sort. 快速排序或合并排序。 I'm worried about worst case. 我担心最坏的情况。

Depending on the structure of the data to store, you can do many different things. 根据要存储的数据的结构,您可以执行许多不同的操作。

In case of well structured data, where you need to sort by one or more specific fields (in which case system tools might not be helpful), you are probably better off using a datastore that allows sorting. 如果数据结构良好,则需要按一个或多个特定字段进行排序(在这种情况下,系统工具可能无济于事),最好使用允许排序的数据存储。 MongoDB comes to mind as a good fit for this given that the size doesn't exceed few 100s of GBs. 考虑到大小不超过100 GB,MongoDB非常适合此操作。 Other NoSQL datastores might also fit the bill nicely, although Mongo's simplicity of use and installation and support for JSON data makes it a really great candidate. 其他的NoSQL数据存储区也可能很合适,尽管Mongo使用和安装的简单性以及对JSON数据的支持使其成为了一个不错的选择。

If you really want to go with the java approach, it gets real tricky. 如果您真的想使用Java方法,那么它将变得非常棘手。 This is the kind of questions you ask at job interviews and I would never actually expect anybody to implement code. 这是您在工作面试中提出的问题,我从不希望任何人实现代码。 However, the general solution is merge sort (using random access files is a bad idea because it means insertion sort, ie, non optimal run time which can be bad given the size of your file). 但是,一般的解决方案是合并排序(使用随机访问文件是一个坏主意,因为这意味着插入排序,即非最佳运行时间,考虑到文件的大小,这可能很糟糕)。

By merge sort I mean reading one chunk of the file at a time small enough to fit it in memory (so it depends on how much RAM you have), sorting it and then writing it back to a new file on disk. 通过合并排序,我的意思是一次读取文件的一个大块,使其足够小以适合内存(因此,这取决于您拥有多少RAM),对其进行排序,然后将其写回到磁盘上的新文件中。 After you read the whole file you can start merging the chunk files two at a time by reading just the head of each and writing (the smaller of the two records) back to a third file. 读取整个文件后,您可以通过一次读取每个文件的开头并将(两个记录中较小的一个)写回第三个文件来开始合并两个块文件。 Do that for the 'first generation' of files and then continue with the second one until you end up with one big sorted file. 对“第一代”文件执行此操作,然后继续第二个文件,直到最终得到一个大排序文件。 Note that this is basically a bottom up way of implementing merge sort, the academic recursive algorithm being the top down approach. 请注意,这基本上是实现合并排序的自下而上的方法,学术递归算法是自上而下的方法。

Note that having intermediate files can be avoided altogether by using a multiway merge algorithm . 请注意,通过使用多路合并算法可以完全避免具有中间文件。 This is typically based on a heap/priority queue, so the implementation might get slightly more complex but it reduces the number of I/O operations required. 这通常基于堆/优先级队列,因此实现可能会稍微复杂一些,但会减少所需的I / O操作数量。

Please also see these links . 另请参阅这些 链接

Implementing the above in java shouldn't be too difficult with some careful design although it can definitely get tricky. 尽管经过一定的精心设计,但在Java中实现上述目标应该不会太困难。 I still highly recommend an out-of-the-box solution like Mongo. 我仍然强烈推荐像Mongo这样的现成解决方案。

As it turns out, your problem is that your heap cannot accommodate such a large array, so you must forget any solution that implies loading the whole file content in an array (as long as you can't grow your heap). 事实证明,您的问题是您的堆无法容纳如此大的数组,因此您必须忘记任何暗示将整个文件内容加载到数组中的解决方案(只要您不能扩大堆)。

So you're facing streaming. 因此,您正在面对流媒体。 It's the only (and typical) solution when you have to handle input sources that are larger than your available memory. 当您必须处理大于可用内存的输入源时,这是唯一(典型)的解决方案。 I would suggest streaming the file content to your program, which should perform the sorting by either outputting to a random access file (trickier) or to a database. 我建议将文件内容流式传输到您的程序,该程序应通过输出到随机访问文件(trickier)或数据库来执行排序。

I'd take a different approach. 我会采取另一种方法。

Given a file, say with a single element per line, I'd read the first n elements. 给定一个文件(例如每行只有一个元素),我将读取前n元素。 I would repeat this m times, such that the amount of lines in the file is n * m + C with C being left-over lines. 我将重复m次,以使文件中的行数为n * m + C其中C为剩余行。

When dealing with Integers , you may wish to use around 100,000 elements per read, with Strings I would use less, maybe around 1,000. 在处理Integers ,您可能希望每次读取使用大约100,000个元素,而使用Strings我会使用较少的元素,也许大约是1,000个。 It depends on the data type and memory needed per element. 它取决于每个元素所需的数据类型和内存。

From there, I would sort the n amount of elements and write them to a temporary file with a unique name. 从那里,我将对n个元素进行排序,并将它们写入具有唯一名称的临时文件中。

Now, since you have all the files sorted, the smallest elements will be at the start. 现在,由于所有文件都已排序,因此最小的元素将在开始。 You can then just iterate over the files until you have processed all the elements, finding the smallest element and printing it to the new final output. 然后,您可以遍历文件,直到处理完所有元素为止,找到最小的元素并将其打印到新的最终输出中。

This approach will reduce the amount of RAM needed and instead rely on drive space and will allow you to handle sorting of any file size. 这种方法将减少所需的RAM数量,而是依赖驱动器空间,并使您可以处理任何文件大小的排序。

Build the array of record positions inside the file (kind of index), maybe it would fit into memory instead. 在文件(索引类型)中建立记录位置的数组,也许它可以放入内存中。 You need a 8 byte java long per file record. 每个文件记录需要8字节long Java。 Sort the array, loading records only for comparison and not retaining (use RandomAccessFile ). 对数组进行排序,仅加载记录以进行比较,而不保留(使用RandomAccessFile )。 After sorting, write the new final file using index pointers to get the records in the needed order. 排序后,使用索引指针写入新的最终文件以按所需顺序获取记录。

This will also work if the records are not all the same size. 如果记录的大小不同,这也将起作用。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM