简体   繁体   中英

Find a string in a very large formatted text file in java

Here is the thing: I have a really big text file and it has a format like this:

0007476|000011434982|00249626000|R|2008-01-11 00:00:00|9999-12-31 23:59:59|000019.99
0007476|000014017887|00313865000|R|2011-04-19 00:00:00|9999-12-31 23:59:59|000599.99
...
...

And I need to find if a particular pattern exists in the file, say

0007476|whatever|00313865000|whatever

All I need is a boolean saying yes or no. Now what I have done is to read the file line by line and do a regular expression matching:

Pattern pattern = Pattern.compile(regex);
Scanner scanner = new Scanner(new File(fileName));
        String line;
        while (scanner.hasNextLine()) {
            line = scanner.nextLine();
            if (pattern.matcher(line).matches()) {
                scanner.close();
                return true;
            }
        }

and the regex has a form of

"0007476\|\d{12}\|0031386500.*

This method works, but it takes usually 15 seconds to search for a string that is far from the start line. Is there a faster way to achieve that? Thanks

The java String class has a contains method which returns a boolean. If your string is fixed, this is a lot faster than a regular expression:

if (string.contains("0007476|") && string.contains("|00313865000|")) {
   // whatever
}

Hope that helped, if not, leave a comment.

I assume that you need the Scanner because the file is too big to read into a single String instead?

If that is not the case, you can probably use a regular expression that finds the match directly. Depending on whether or not you care about the specific text at the start of the line you can you something along the lines of:

"(?m)^0007476\\|\\d{12}\\|0031386500.*$

If you do need to break it up into smaller chunks because of memory usage I would suggest not reading on a per line basis, (since the lines are rather short), but process bigger chunks using something like a BufferedReader instead?


I fiddled around a bit with a 1.25GB file and the following is about 2.5 times faster than your implementation:

private static boolean matches() throws IOException {
   String regex = "(?m)^0007476\|\d{12}\|0031386500.*$";
   Pattern pattern = Pattern.compile(regex);

   try(BufferedReader br = new BufferedReader(new FileReader(FILENAME))) {
      for(String lines; (lines = readLines(br, 10000)) != null; ) {
         if (pattern.matcher(lines).find()) {
            return true;
         }
      }
   }

   return false;
}

private static String readLines(BufferedReader br, int amount) throws IOException {
   StringBuilder builder = new StringBuilder();
   int lineCounter = 0;
   for(String line; (line = br.readLine()) != null && lineCounter < amount; lineCounter++ ) {
      builder.append(line).append(System.lineSeparator());
   }

    return lineCounter > 0 ? builder.toString() : null;
}

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM