简体   繁体   中英

How to find difference (line-based) in sorted large text files in Java without loading them in full into memory?

How to find a difference (line-based) in sorted large text files in Java without loading them in full into memory?
Something similar to Unix "diff" (which also seems to be loading whole files in memory ), which can identify missing/extra lines, but in Java.

Linked question: Comparing two text large files with URLs in Java with external memory only?

You would need to read only from file which have smallest line(from compareTo perspective). In case both are the same, you read a line from both files, in case one bigger than other, you read only from the file with smaller compareTo. In case you don't read from same files twice in a row it mean you have a difference. All lines between switching reading are different( Switch from reading only from file 1 to file 2 or both or switching from reading only file 2 to file1 or both).

A sample to be more clear. Case you switch from file1 reading to file2:

            if(line1.compareTo(line2)>0){
                if(lastRead==1) {
                    System.out.println(previousLines+ " found in "+path1 +" but not in "+ path2);
                    previousLines.clear();
                }
                previousLines.add(line2);
                line2=in2.readLine();
                 lastRead = 1;
            } 

In case line1 is bigger than line2( line1 being current line from file1, line2 current line from file 2), it mean I'll next go to read only from second file. And in case in the past,I've read only from file1(not from both at same time or second one), all lines in previousLines should be listed. In previousLines, I add lines when they are different. lastRead keep track of the last file I read from(0 - both at same time, 1 - only first, 2-only second).

Late edit: All method body, but as I mentioned in the comment,it didn't check what happen if I finish read from one file before another. As it is now it works fine if you set last line of file the same on both files. You can add further checks for readLine is null for one file or another.

void toTitleCase(Path path1, Path path2) {

try(BufferedReader in1= Files.newBufferedReader(path1);
    BufferedReader in2= Files.newBufferedReader(path2)) {
    String line1=in1.readLine(),line2=in2.readLine();
    int lastRead=0;
    List<String> previousLines=new ArrayList<>();
    while(line1!=null && line2!=null){
        if(line1.compareTo(line2)>0){
            if(lastRead==1) {
                System.out.println(previousLines+ " found in "+path1 +" but not in "+ path2);
                previousLines.clear();
            }
            previousLines.add(line2);
            line2=in2.readLine();
            lastRead = 2;
        } else if(line1.compareTo(line2)<0){
                if(lastRead==2) {
                    System.out.println(previousLines+ " found in "+path2 +" but not in "+ path1);
                    previousLines.clear();
                }
                previousLines.add(line1);
                line1=in1.readLine();
                    lastRead = 1;

            } else{
                if(lastRead==2) {
                    System.out.println(previousLines+ " found in "+path2 +" but not in "+ path1);
                }
                if(lastRead==1) {
                    System.out.println(previousLines+ " found in "+path1 +" but not in "+ path2);
                }
                previousLines.clear();
                line1=in1.readLine();
                line2=in2.readLine();
                lastRead=0;
            }
    }
} catch (IOException e) {
    e.printStackTrace();
}
    }

I thought this might be an interesting problem, so I put something together to illustrate how a difference application might work.

I had a file of words for a different application. So, I grabbed the first 100 words and reduced the size of each down to something I could test with easily.

Word List 1

aback
abandon
abandoned
abashed
abatement
abbey
abbot
abbreviate
abdomen
abducted
aberrant
aberration
abetted
abeyance

Word List 2

aardvark
aback
abacus
abandon
abatement
abbey
abbot
abbreviate
abdicate
abdomen
aberrant
aberration

My example application produces two different outputs. Here's the first output from my test run, the full difference output.

Differences between /word1.txt and /word2.txt
-----------------------------------------------------

------   Inserted   ----- | aardvark                 
aback                     | aback                    
------   Inserted   ----- | abacus                   
abandon                   | abandon                  
abandoned                 | ------   Deleted   ------
abashed                   | ------   Deleted   ------
abatement                 | abatement                
abbey                     | abbey                    
abbot                     | abbot                    
abbreviate                | abbreviate               
------   Inserted   ----- | abdicate                 
abdomen                   | abdomen                  
abducted                  | ------   Deleted   ------
aberrant                  | aberrant                 
aberration                | aberration               
abetted                   | ------   Deleted   ------
abeyance                  | ------   Deleted   ------

Now, for two really long files, where most of the text will match, this output would be hard to read. So, I also created an abbreviated output.

Differences between /word1.txt and /word2.txt
-----------------------------------------------------

------   Inserted   ----- | aardvark                 
---------------   1 line is the same   --------------
------   Inserted   ----- | abacus                   
---------------   1 line is the same   --------------
abandoned                 | ------   Deleted   ------
abashed                   | ------   Deleted   ------
--------------   4 lines are the same   -------------
------   Inserted   ----- | abdicate                 
---------------   1 line is the same   --------------
abducted                  | ------   Deleted   ------
--------------   2 lines are the same   -------------
abetted                   | ------   Deleted   ------
abeyance                  | ------   Deleted   ------

With these small test files, there's not much difference between the two reports.

With two large text files, the abbreviated report would be a lot easier to read.

Here's the example code.

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

public class Difference {

    public static void main(String[] args) {
        String file1 = "/word1.txt";
        String file2 = "/word2.txt";

        try {
            new Difference().compareFiles(file1, file2);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    private void compareFiles(String file1, String file2)
            throws IOException {
        int columnWidth = 25;
        int pageWidth = columnWidth + columnWidth + 3;
        boolean isFullReport = true;

        System.out.println(getTitle(file1, file2));
        System.out.println(getDashedLine(pageWidth));
        System.out.println();

        URL url1 = getClass().getResource(file1);
        URL url2 = getClass().getResource(file2);

        BufferedReader br1 = new BufferedReader(new InputStreamReader(
                url1.openStream()));
        BufferedReader br2 = new BufferedReader(new InputStreamReader(
                url2.openStream()));

        int countEqual = 0;
        String line1 = br1.readLine();
        String line2 = br2.readLine();

        while (line1 != null && line2 != null) {
            int result = line1.compareTo(line2);
            if (result == 0) {
                countEqual++;
                if (isFullReport) {
                    System.out.println(getFullEqualsLine(columnWidth,
                            line1, line2));
                }
                line1 = br1.readLine();
                line2 = br2.readLine();
            } else if (result < 0) {
                printEqualsLine(pageWidth, countEqual, isFullReport);
                countEqual = 0;
                System.out.println(getDifferenceLine(columnWidth,
                        line1, ""));
                line1 = br1.readLine();
            } else {
                printEqualsLine(pageWidth, countEqual, isFullReport);
                countEqual = 0;
                System.out.println(getDifferenceLine(columnWidth,
                        "", line2));
                line2 = br2.readLine();
            }
        }

        printEqualsLine(pageWidth, countEqual, isFullReport);

        while (line1 != null) {
            System.out.println(getDifferenceLine(columnWidth,
                    line1, ""));
            line1 = br1.readLine();
        }

        while (line2 != null) {
            System.out.println(getDifferenceLine(columnWidth,
                    "", line2));
            line2 = br2.readLine();
        }

        br1.close();
        br2.close();
    }

    private void printEqualsLine(int pageWidth, int countEqual,
            boolean isFullReport) {
        if (!isFullReport && countEqual > 0) {
            System.out.println(getEqualsLine(countEqual, pageWidth));
        }
    }

    private String getTitle(String file1, String file2) {
        return "Differences between " + file1 + " and " + file2;
    }

    private String getEqualsLine(int count, int length) {
        String lines = "lines are";
        if (count == 1) {
            lines = "line is";
        }
        String output = "   " + count + " " + lines +
                " the same   ";
        return getTextLine(length, output);
    }

    private String getFullEqualsLine(int columnWidth, String line1,
            String line2) {
        String format = "%-" + columnWidth + "s";
        return String.format(format, line1) + " | " +
            String.format(format, line2);
    }

    private String getDifferenceLine(int columnWidth, String line1,
            String line2) {
        String format = "%-" + columnWidth + "s";
        String deleted = getTextLine(columnWidth, "   Deleted   ");
        String inserted = getTextLine(columnWidth, "   Inserted   ");

        if (line1.isEmpty()) {
            return inserted + " | " + String.format(format, line2);
        } else {
            return String.format(format, line1) + " | " + deleted;
        }
    }

    private String getTextLine(int length, String output) {
        int half2 = (length - output.length()) / 2;
        int half1 = length - output.length() - half2;
        output = getDashedLine(half1) + output;
        output += getDashedLine(half2);
        return output;
    }

    private String getDashedLine(int count) {
        String output = "";
        for (int i = 0; i < count; i++) {
            output += "-";
        }
        return output;
    }

}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM