简体   繁体   中英

What's the fastest way in Java to count lines starting with a String in a huge file

I have huge files (4.5 GB each) and need to count the number of lines in each file that start with a given token. There can be up to 200k occurrences of the token per file.

What would be the fastest way to achieve such a huge file traversal and String detection? Is there a more efficient approach than the following implementation using a Scanner and String.startsWith() ?

public static int countOccurences(File inputFile, String token) throws FileNotFoundException {
    int counter = 0;
    try (Scanner scanner = new Scanner(inputFile)) {
        while (scanner.hasNextLine()) {
            if (scanner.nextLine().startsWith(token)) {
                counter++;
            }
        }
    }
    return counter;
}

Note:

  • So far it looks like the Scanner is the bottleneck (ie if I add more complex processing than token detection and apply it on all lines, the overall execution time is more or less the same.)
  • I'm using SSDs so there is no room for improvement on the hardware side

Thanks in advance for your help.

A few pointers (assumption is that the lines are relatively short and the data is really ASCII or similar) :

  • read a huge buffer of bytes at a time, (say 1/4 GB), then chop off the incomplete line to prepend to the next read.

  • search for bytes, do not waste time converting to chars

  • indicate "beginning of line by starting your search pattern with '\\n', treat first line specially

  • use high-speed search that reduces search time at the expense of pre-processing (google for "fast substring search")

  • if actual line numbers (rather than the lines) are needed, count the lines in a separate stage

We can reduce the problem to searching for \\n<token> in a bytestream. In that case, one quick way is to read a chunk of data sequentially from disk (The size is determined empirically, but a good starting point is 1024 pages), and hand that data to a different thread for processing.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM