What's the fastest way in Java to count lines starting with a String in a huge file

Question

I have huge files (4.5 GB each) and need to count the number of lines in each file that start with a given token. There can be up to 200k occurrences of the token per file.

What would be the fastest way to achieve such a huge file traversal and String detection? Is there a more efficient approach than the following implementation using a Scanner and String.startsWith() ?

public static int countOccurences(File inputFile, String token) throws FileNotFoundException {
    int counter = 0;
    try (Scanner scanner = new Scanner(inputFile)) {
        while (scanner.hasNextLine()) {
            if (scanner.nextLine().startsWith(token)) {
                counter++;
            }
        }
    }
    return counter;
}

Note:

So far it looks like the Scanner is the bottleneck (ie if I add more complex processing than token detection and apply it on all lines, the overall execution time is more or less the same.)
I'm using SSDs so there is no room for improvement on the hardware side

Thanks in advance for your help.

Answer 1

A few pointers (assumption is that the lines are relatively short and the data is really ASCII or similar) :

read a huge buffer of bytes at a time, (say 1/4 GB), then chop off the incomplete line to prepend to the next read.
search for bytes, do not waste time converting to chars
indicate "beginning of line by starting your search pattern with '\\n', treat first line specially
use high-speed search that reduces search time at the expense of pre-processing (google for "fast substring search")
if actual line numbers (rather than the lines) are needed, count the lines in a separate stage

Answer 2

We can reduce the problem to searching for \\n<token> in a bytestream. In that case, one quick way is to read a chunk of data sequentially from disk (The size is determined empirically, but a good starting point is 1024 pages), and hand that data to a different thread for processing.

What's the fastest way in Java to count lines starting with a String in a huge file

Question

2 answers

solution1
1 ACCPTED 2017-03-22 19:30:48

solution2
1 2017-03-22 19:36:43

What's the fastest way in Java to count lines starting with a String in a huge file

Question

2 answers

solution1 1 ACCPTED 2017-03-22 19:30:48

solution2 1 2017-03-22 19:36:43

solution1
1 ACCPTED 2017-03-22 19:30:48

solution2
1 2017-03-22 19:36:43