简体   繁体   中英

Fastest way to parse txt file in Java

I have to parse a txt file for a tax calculator that has this form:

Name: Mary Jane
Age: 23
Status: Married
Receipts:

Id: 1
Place: Restaurant
Money Spent: 20

Id: 2
Place: Mall
Money Spent: 30

So, what i have done so far is:

public void read(File file) throws FileNotFoundException{
    Scanner scanner = new Scanner(file);
    String[] tokens = null;

    while(scanner.hasNext()){
        String line= scanner.nextLine();
        tokens = line.split(":");
        String lastToken = tokens[tokens.length - 1];
        System.out.println(lastToken);

So, I want to access only the second column of this file (Mary Jane, 23, Married) to a class taxpayer(name, age, status) and the receipts' info to an Arraylist.

I thought of taking the last token and save it to an String array, but I can't do that because I can't save string to string array. Can someone help me? Thank you.

The fastest way, if your data is ASCII and you don't need charset conversion, is to use a BufferedInputStream and do all the parsing yourself -- find the line terminators, parse the numbers. Do NOT use a Reader, or create Strings, or create any objects per line, or use parseInt. Just use byte arrays and look at the bytes. It's a little messier, but pretend you're writing C code, and it will be faster.

Also give some thought to how compact the data structure you're creating is, and whether you can avoid creating an object per line there too by being clever.

Frankly, I think the "fastest" is a red herring. Unless you have millions of these files, it is unlikely that the speed of your code will be relevant.

And in fact, your basic approach to parsing (read line using Scanner, split line using String.split(...) seems pretty sound.

What you are missing is that the structure of your code needs to match the structure of the file. Here's a sketch of how I would do it.

  • If you are going to ignore the first field of each line, you need a method that:

    1. reads a line, skipping empty lines
    2. splits it, and
    3. returns the second field.
  • If you are going to check that the first field contains the expected keyword, then modify the method to take a parameter, and check the field. (I'd recommend this version ...)

  • Then call the above method in the correct pattern; eg

    • call it 3 times to extract the name, age and marital status
    • call it 1 time to skip the "reciepts" line
    • use a while loop to call the method 3 times to read the 3 fields for each receipt.

Do you really need it to be as fast as possible? In situations like this, it's often fine to create a few objects and do a bit of garbage collection along the way in order to have more maintainable code.

I'd use two regular expressions myself (one for the taxpayer and another for the receipts loop).

My code would look something like:

public class ParsedFile {
    private Taxpayer taxpayer;
    private List<Receipt> receipts;

    // getters and setters etc.
}

public class FileParser {
    private static final Pattern TAXPAYER_PATTERN =
        // this pattern includes capturing groups in brackets ()
        Pattern.compile("Name: (.*?)\\s*Age: (.*?)\\s*Status: (.*?)\\s*Receipts:", Pattern.DOTALL);

    public ParsedFile parse(File file) {
        BufferedReader reader = new BufferedReader(new FileReader(file)));
        String firstChunk = getNextChunk(reader);
        Taxpayer taxpayer = parseTaxpayer(firstChunk);
        List<Receipt> receipts = new ArrayList<Receipt>();
        String chunk;
        while ((chunk = getNextChunk(reader)) != null) {
            receipts.add(parseReceipt(chunk));
        }
        return new ParsedFile(taxpayer, receipts);
    }

    private TaxPayer parseTaxPayer(String chunk) {
       Matcher matcher = TAXPAYER_PATTERN.matcher(chunk);
       if (!matcher.matches()) {
           throw new Exception(chunk + " does not match " + TAXPAYER_PATTERN.pattern());
       }
       // this is where we use the capturing groups from the regular expression
       return new TaxPayer(matcher.group(1), matcher.group(2), ...);
    }

    private Receipt parseReceipt(String chunk) {
       // TODO implement
    }

    private String getNextChunk(BufferedReader reader) {
       // keep reading lines until either a blank line or end of file
       // return the chunk as a string
    }
}

First why do you need to invest time into the fastest possible solution? Is it because the input file is huge? I also do not understand how you want to store result of parsing? Consider new class with all fields you need to extract from file per person.

Few tips: - Avoid unnecessary per-line memory allocations. line.split(":") in your code is example of this. - Use buffered input. - Minimize input/output operations.

If these are not enough for you try to read this article http://nadeausoftware.com/articles/2008/02/java_tip_how_read_files_quickly

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM