简体   繁体   中英

Regex to search and replace text in a large file

I am searching for a multiline pattern in a huge file and if found need to replace the contents. I want to accomplish this in a memory efficient way. My current implementation reads text from file in chunks in 4096 bytes. Then it applies regex search replace and save the result in buffer outputstream. This does gives me some memory improvements by not loading the entire file in memory however I am making a lot of IO with map/flush calls. Need suggestions on further improving my code. Also, the algo fails if the pattern to be searched is divided into adjacent chunks. Any ideas on how to efficiently search-replace the text getting divided in adjacent chunks. Assumptions : The text to search is always less that 4096 bytes.

public void searchAndReplace (String inputFilePath, String outputFilePath) {

    Pattern HEADER_PATTERN =  Pattern.compile("<a [^>]*>[^(</a>)]*</a>", Pattern.DOTALL);
    Charset UTF8 = Charset.forName("UTF-8");
    File outputFile = new File(outputfilepath);
    if (!outputFile.exists()) {
        outputFile.createNewFile();
    }

    FileInputStream inputStream = new FileInputStream(new File(inputfilepath));
    FileOutputStream outputStream = new FileOutputStream(outputFile);

    FileChannel inputChannel = inputStream.getChannel();

    final long length = inputChannel.size();
    long pos = 0;
    while (pos < length) {
        int remaining = (int)(length - pos) > 4096 ? 4096 : (int)(length - pos);
        MappedByteBuffer map = inputChannel.map(FileChannel.MapMode.READ_ONLY, pos, remaining);
        CharBuffer cbuf = UTF8.newDecoder().decode(map);
        Matcher matcher = HEADER_PATTERN.matcher(cbuf);
        StringBuffer sb = new StringBuffer();
        while (matcher.find()) {
            matcher.appendReplacement(sb, "Some text");
        }
        matcher.appendTail(sb);
        outputStream.write(sb.toString().getBytes());
        outputStream.flush();
        pos = pos + 4096;
    }

    inputStream.close();
    outputStream.close(); 
}

Declare a list of special characters that unlikely to be in your string. Then test your string to ensure that one of the special characters doesn't exit in inside it. Plant the special character between the areas you want to do your regex. Then you can do a find/replace or search with /[^¬]*myRegExHere[^\\¬]/g

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM