简体   繁体   English

用Java排序一个巨大的文件

[英]Sorting a huge file in Java

I have a file, which consists of a one row: 我有一个文件,它由一行组成:

 1 , 1 2 , 1 3 6 , 4 ,...

In this representation, spaces separate the integers and commas. 在此表示中,空格分隔整数和逗号。 This string is so huge that I can't read it with RandomAccessFile.readLine() (almost 4 Gb needed). 这个字符串是如此巨大,我无法用RandomAccessFile.readLine()读取它(几乎需要4 Gb)。 So that I created a buffer, which can contain 10 integers. 这样我就创建了一个缓冲区,它可以包含10个整数。 My task is to sort all integers in the string. 我的任务是对字符串中的所有整数进行排序。

Could you, please, help? 能否请你帮忙?

EDIT 编辑

@Oscar Reyes @Oscar Reyes

I need to write some sequences of integers to a file and then to read from it. 我需要将一些整数序列写入文件然后从中读取。 Actually I don't know, how to do it. 其实我不知道,怎么做。 I'm a newbie. 我是新手。 So I decided to use chars to write integers, delimiters between integers are ",", and delimeters between sequences are "\\n\\r" which. 所以我决定使用字符来编写整数,整数之间的分隔符是“,”,序列之间的分隔符是“\\ n \\ r”。 So that I created a monster that reads it: 所以我创造了一个读它的怪物:

public BinaryRow getFilledBuffer(String filePath, long offset) throws IOException{
    mainFile = new RandomAccessFile(filePath, "r");

    if (mainFile.length() == 0){
        return new BinaryRow();
    }

    StringBuilder str = new StringBuilder();

    mainFile.seek(mainFile.length()-4); //that is "\n" symbol
    char chN = mainFile.readChar();

    mainFile.seek(offset);
    int i = 0;
    char nextChar = mainFile.readChar();
    while (i < 11 && nextChar != chN){
        str.append(nextChar);
        if (nextChar == ','){
            i++;
            if (i == 10){
                break;
            }
        }
        nextChar = mainFile.readChar();
    }

    if (nextChar == chN){
        position = -1;
    }else{
        position = mainFile.getFilePointer();
    }

    BinaryRow br = new BinaryRow();

    StringBuilder temp = new StringBuilder();

    for (int j = 0; j < str.length(); j++){
        if ((str.charAt(j) != ',')){
            temp.append(str.charAt(j));
            if (j == str.length() - 1){
                br.add(Integer.parseInt(temp.toString()));
            }   
        }else{
            br.add(Integer.parseInt(temp.toString()));
            temp.delete(0, temp.length());
        }
    }


    mainFile.close();
    return br;

}

If you could advise how to do it, please do it =) 如果您可以建议如何做,请这样做=)

This is exactly the origin QuickSort back then there was not enough RAM to sort in memory so they procedure is to store partial results in disk. 这正是QuickSort的起源,然后没有足够的RAM在内存中排序,所以他们的程序是将部分结果存储在磁盘中。

So what you can do is: 所以你能做的是:

  1. Pick a pivot. 选择一个支点。
  2. Read sequentially your file and store data lower than pivot in temp_file_1 and data bigger or equal to the pivot in temp_file_2 按顺序读取文件并在temp_file_1中存储低于pivot的数据,并在temp_file_2中存储大于或等于pivot的数据
  3. Repeat the procedure in temp_file_1 and append the result to result_file 在temp_file_1中重复该过程,并将结果追加到result_file
  4. Repeat the procedure for temp_file_2 and append the result to result_file 重复temp_file_2的过程并将结果追加到result_file

When parts are small enough ( like 2 just direct swap them Enough to be sorted in memory ) 当零件足够小时( 如2只是直接交换它们 足够在内存中分类)

This way you'll be able to sort in chunks and store the partial results in temp files and you'll have a final file with the result sorted. 这样您就可以对块进行排序并将部分结果存储在临时文件中,并且您将获得一个结果已排序的最终文件。

EDIT I told you a quick sort was possible. 编辑我告诉你快速排序是可能的。

It seems like you would need some extra space for the temp files after all. 毕竟你似乎需要额外的空间来存放临时文件。

Here's what I did. 这就是我做的。

I create a 40 mb file with numbers separated by commas. 我创建一个40 MB的文件,数字用逗号分隔。

I name it input : 我把它命名为input

input http://img200.imageshack.us/img200/5129/capturadepantalla201003t.png 输入http://img200.imageshack.us/img200/5129/capturadepantalla201003t.png

Input is 40mb 输入为40mb

During the sort, the tmp files with the buckets of "greater than", "lower than" values are created and when the sort is finished, the values are sent to a file called ( guess what ) output 在排序期间,创建具有“大于”,“低于”值的桶的tmp文件,并且当排序完成时,将值发送到称为(猜测什么) output的文件

processing http://img200.imageshack.us/img200/1672/capturadepantalla201003y.png 处理http://img200.imageshack.us/img200/1672/capturadepantalla201003y.png

Temp files are created with the partial results 使用部分结果创建临时文件

Finally all the tmp files are deleted and the result is kept in the file "output" with the correct sorted sequence of numbers: 最后删除所有tmp文件,并将结果保存在文件“output”中,并使用正确的排序数字序列:

output http://img203.imageshack.us/img203/5950/capturadepantalla201003w.png 输出http://img203.imageshack.us/img203/5950/capturadepantalla201003w.png

Finally the file "output" is created, notice it is 40 mb too 最后创建文件“output”,注意它也是40 mb

Here's the full program. 这是完整的计划。

import java.io.*;
import java.util.*;

public class FileQuickSort {

    static final int MAX_SIZE = 1024*1024*16; // 16 megabytes in this sample, the more memory your program has, less disk writing will be used. 
    public static void main( String [] args ) throws IOException {
        fileQuickSort( new File("input"), new File("output"));
        System.out.println();
    }

    //
    static void fileQuickSort( File inputFile, File outputFile ) throws IOException {
        Scanner scanner = new Scanner( new BufferedInputStream( new FileInputStream( inputFile ), MAX_SIZE));
        scanner.useDelimiter(",");

        if( inputFile.length() > MAX_SIZE && scanner.hasNextInt()) {
            System.out.print("-");

            // put them in two buckets... 
            File lowerFile = File.createTempFile("quicksort-","-lower.tmp",new File("."));
            File greaterFile = File.createTempFile("quicksort-","-greater.tmp", new File("."));
            PrintStream  lower   = createPrintStream(lowerFile);
            PrintStream greater  = createPrintStream(greaterFile);
            PrintStream target = null;
            int pivot = scanner.nextInt();

            // Read the file and put the values greater than in a file 
            // and the values lower than in other 
            while( scanner.hasNextInt() ){
                int current = scanner.nextInt();

                if( current < pivot ){
                    target = lower;
                } else {
                    target = greater;
                }
                target.printf("%d,",current);
            }
            // avoid dropping the pivot
            greater.printf("%d,",pivot);
            // close the stream before reading them again
            scanner.close();
            lower.close();
            greater.close();
            // sort each part
            fileQuickSort( lowerFile , outputFile );
            lowerFile.delete();
            fileQuickSort( greaterFile   , outputFile);
            greaterFile.delete();

            // And you're done.
        } else {

            // Else , if you have enough RAM to process it
            // 
            System.out.print(".");
            List<Integer> smallFileIntegers = new ArrayList<Integer>();
            // Read it
            while( scanner.hasNextInt() ){
                smallFileIntegers.add( scanner.nextInt() );
            }
            scanner.close();

            // Sort them in memory 
            Collections.sort( smallFileIntegers );

            PrintStream out = createPrintStream( outputFile);
            for( int i : smallFileIntegers ) {
                out.printf("%d,",i);
            }
            out.close();
            // And your're done
        }
    }
    private static PrintStream createPrintStream( File file ) throws IOException {
        boolean append = true;
        return new PrintStream(  new BufferedOutputStream( new FileOutputStream( file, append )));
    }
}

The format of the files is number,number,number,number 文件格式为number,number,number,number

Your current format is: number , numb , ber 您当前的格式是: number , numb , ber

To fix that you just have to read it all and skip the blanks. 要解决这个问题,你只需要阅读全部内容并跳过空白即可。

Add another question for that. 为此添加另一个问题。

Read it to memory in chunks (100 MB each?), one chunk at a time, sort it and save to disk. 以块(每个100 MB?)读取内存,一次一个块,对其进行排序并保存到磁盘。

Then open all the ordered chunks, read the first element of each, and append the lowest to the output. 然后打开所有已排序的块,读取每个块的第一个元素,并将最低值附加到输出。 Then read the next element of the chunk you just read from and repeat. 然后读取刚刚读取的块的下一个元素并重复。

When merging you can keep an array of the last int read from each chunk and just iterate over it to get the lowest. 合并时,您可以保留从每个块读取的最后一个int数组,并迭代它以获得最低值。 Then you substitute the value you just used with the next element in the chunk it was taken from. 然后,将刚刚使用的值替换为取自它的块中的下一个元素。

example with chunks [1, 5, 16] [2, 9, 14] [3, 8, 10]
array [(1), 2, 3], lowest 1 --> to output
      [5, (2), 3], lowest 2 --> to output
      [5, 9, (3)], lowest 3 -->
      [(5), 9, 8],        5
      [16, 9, (8)],       8
      [16, (9), 10],      9 
...

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM