简体   繁体   中英

merging sorted files Java

im implementing external merge sort using Java.

So given a file I split it into smaller ones , then sort the smaller portions and finally merge the sorted (smaller) files.

So , the last step is what im having trouble with.

I have a list of files and I want at each step , take the minimum value of the first rows of each file and then remove that line.

So , it is supposed to be something like this:

public static void mergeSortedFiles(List<File> sorted, File output) throws IOException {
    BufferedWriter wf = new BufferedWriter(new FileWriter(output));
    String curLine = "";
    while(!sorted.isEmpty()) {
        curLine = findMinLine(sorted);
        wf.write(curLine);
    }
}

public static String findMinLine(List<File> sorted) throws IOException {
    List<BufferedReader> brs = new ArrayList<>();
    for(int i =0; i<sorted.size() ; i++) {
        brs.add(new BufferedReader(new FileReader(sorted.get(i))));
    }
    List<String> lines = new ArrayList<>();
    for(BufferedReader br : brs) {
        lines.add(br.readLine());
    }
    Collections.sort(lines);
    return lines.get(0);
}

Im not sure how to update the files, anyone can help with that?

Thanks for helping!

You can create a Comparable wrapper around each file and then place the wrappers in a heap (for example a PriorityQueue ).

public class ComparableFile<T extends Comparable<T>> implements Comparable<ComparableFile<T>> {
    private final Deserializer<T> deserializer;
    private final Iterator<String> lines;
    private T buffered;

    public ComparableFile(File file, Deserializer<T> deserializer) {
        this.deserializer = deserializer;
        try {
            this.lines = Files.newBufferedReader(file.toPath()).lines().iterator();
        } catch (IOException e) {
            // deal with it differently if you want, I'm just providing a working example
            // and wanted to use the constructor in a lambda function
            throw new UncheckedIOException(e);
        }
    }

    @Override
    public int compareTo(ComparableFile<T> that) {
        T mine = peek();
        T theirs = that.peek();

        if (mine == null) return theirs == null ? 0 : -1;
        if (theirs == null) return 1;
        return mine.compareTo(theirs);
    }

    public T pop() {
        T tmp = peek();

        if (tmp != null) {
            buffered = null;
            return tmp;
        }

        throw new NoSuchElementException();
    }

    public boolean isEmpty() {
        return peek() == null;
    }

    private T peek() {
        if (buffered != null) return buffered;
        if (!lines.hasNext()) return null;
        return buffered = deserializer.deserialize(lines.next());
    }
}

Then, you can merge them this way:

public class MergeFiles<T extends Comparable<T>> {
    private final PriorityQueue<ComparableFile<T>> files;

    public MergeFiles(List<File> files, Deserializer<T> deserializer) {
        this.files = new PriorityQueue<>(files.stream()
                .map(file -> new ComparableFile<>(file, deserializer))
                .filter(comparableFile -> !comparableFile.isEmpty())
                .collect(toList()));
    }

    public Iterator<T> getSortedElements() {
        return new Iterator<T>() {
            @Override
            public boolean hasNext() {
                return !files.isEmpty();
            }

            @Override
            public T next() {
                if (!hasNext()) throw new NoSuchElementException();
                ComparableFile<T> head = files.poll();
                T next = head.pop();
                if (!head.isEmpty()) files.add(head);
                return next;
            }
        };
    }
}

And here's some code to demonstrate it works:

public static void main(String[] args) throws IOException {
    List<File> files = Arrays.asList(
            newTempFile(Arrays.asList("hello", "world")),
            newTempFile(Arrays.asList("english", "java", "programming")),
            newTempFile(Arrays.asList("american", "scala", "stackoverflow"))
    );

    Iterator<String> sortedElements = new MergeFiles<>(files, line -> line).getSortedElements();
    while (sortedElements.hasNext()) {
        System.out.println(sortedElements.next());
    }
}

private static File newTempFile(List<String> words) throws IOException {
    File tempFile = File.createTempFile("sorted-", ".txt");
    Files.write(tempFile.toPath(), words);
    tempFile.deleteOnExit();
    return tempFile;
}

Output:

american
english
hello
java
programming
scala
stackoverflow
world

So what you want to do is to swap two lines in a text file? You can do it by using a RandomAccessFile however this will be horrible slow since everytime when you swap two lines you have to wait for the next IO burst. So i highly recommend you to use the following code to be able to do the merge sort on the heap:

List<String> lines1 = Files.readAllLines(youFile1);
List<String> lines2 = Files.readAllLines(youFile2);
//use merge sort on theese lines
List<String> merged;
FileWriter writer = new FileWriter(yourOutputFile); 
for(String str: merged) {
    writer.write(str + System.lineSeparator());
}
writer.close();

The standard merge technique between a fixed number of files (say, 2) is :

  • have a variable for the value of the ordering key of the current record of each file (for java, make that variable Comparable).
  • start the process by reading the first record of each file (and fill in the corresponding variable)
  • loop (until end-of-file on both) through a code block that says essentially

if (key_1.compareTo(key_2) == 0) { process both files ; then read both files} else if (key_1.compareTo(key_2) == -1) { process file 1 ; then read file 1} else { process file 2 ; then read file 2}

Note how this code does essentially nothing more than determine the file with the lowest key, and process that.

If your number of files is variable, then your number of key variables is variable too, and "determining the file with the lowest current key" cannot be done as per above. Instead, have as many current_key_value objects as there are files, and store them all in a TreeSet. Now, the first element of the TreeSet will be the lowest current key value of all the files and if you make sure that you maintain a link between your key variable and the file number you just process that file (and delete the just processed key value from the TreeSet and read a new record from the processed file and add its key value to the TreeSet).

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM