I have huge files (4.5 GB each) and need to count the number of lines in each file that start with a given token. There can be up to 200k occurrences of the token per file.
What would be the fastest way to achieve such a huge file traversal and String detection? Is there a more efficient approach than the following implementation using a Scanner
and String.startsWith()
?
public static int countOccurences(File inputFile, String token) throws FileNotFoundException {
int counter = 0;
try (Scanner scanner = new Scanner(inputFile)) {
while (scanner.hasNextLine()) {
if (scanner.nextLine().startsWith(token)) {
counter++;
}
}
}
return counter;
}
Note:
Scanner
is the bottleneck (ie if I add more complex processing than token detection and apply it on all lines, the overall execution time is more or less the same.) Thanks in advance for your help.
A few pointers (assumption is that the lines are relatively short and the data is really ASCII or similar) :
read a huge buffer of bytes at a time, (say 1/4 GB), then chop off the incomplete line to prepend to the next read.
search for bytes, do not waste time converting to chars
indicate "beginning of line by starting your search pattern with '\\n', treat first line specially
use high-speed search that reduces search time at the expense of pre-processing (google for "fast substring search")
if actual line numbers (rather than the lines) are needed, count the lines in a separate stage
We can reduce the problem to searching for \\n<token>
in a bytestream. In that case, one quick way is to read a chunk of data sequentially from disk (The size is determined empirically, but a good starting point is 1024 pages), and hand that data to a different thread for processing.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.