简体   繁体   中英

Fastest/Most efficient way to parse a document, search for strings and replace them in document with Java

So I have been working on the java program that scans and parses a number of files replacing terms (such as func_123) with their readable format.

There are three files that provide definitions, so each file needs to be parsed thrice.

The program loads definitions into a class called Pair and puts that pair into a ArraryList.

Then the program goes through each file line by line and replaces any matched string. Creating and running a new thread for each file.

So what would be the fastest/most efficient way to parse, replace and write these changes to the new file?

Below is what I have so far.

Code that parses through each file:

Thread thread = new Thread() {
    @Override
    public void run() {
        try {
            File temp = File.createTempFile("temp", "tmp");
            BufferedReader br = new BufferedReader(new FileReader(file));
            BufferedWriter bw = new BufferedWriter(new FileWriter(temp));
            String s = null;
            while ((s = br.readLine()) != null) {
            s = Deobfuscator2.deobfuscate(s);
                bw.write(s);
                bw.newLine();
            }
        bw.close();
        br.close();
        writeFromFileTo(temp, file);
        temp.delete();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
};

Code that decodes each string:

public static String deobfuscate(String s) {
    for (Pair<String, String> pair : fieldsMappings) {
        s = s.replaceAll(pair.key, pair.value);
    }
    for (Pair<String, String> pair : methodsMappings) {
        s = s.replaceAll(pair.key, pair.value);
    }
    for (Pair<String, String> pair : paramsMappings) {
        s = s.replaceAll(pair.key, pair.value);
    }
    return s;
}

Pair Class:

public static class Pair <K,V> {

    private K key;
    private V value;

    public Pair(K key, V value) {
        this.key = key;
        this.value = value;
    }

    public K getKey() {
        return key;
    }

    public V getValue() {
        return value;
    }

}

Helper function to copy contents from one file to another:

private void writeFromFileTo(File file1, File file2) throws IOException {
    BufferedReader br = new BufferedReader(new FileReader(file1));
    BufferedWriter bw = new BufferedWriter(new FileWriter(file2));

    String s = null;
    while ((s = br.readLine()) != null) {
        bw.write(s);
        bw.newLine();
    }
    bw.close();
    br.close();
}

I tried to be as clear as possible and give all the relevant code, but if you need/want anything else let me know.

My code works, but my problem is that it seems to take some time doing so and can be pretty resource intensive (if I don't limit the threads) when there are a lot of files to parse. In total there are about 33,000+ (10,000+ each) total definitions that would need to potentially be replaced.

Repeatedly calling replaceAll is expensive, as the regular expressions will be recompiled on every pass, and also you're creating new instances of the string for each replacement. A better approach is to precompile a regexp matching any key, then iterate across the string and replace each found key with the corresponding value:

static Pattern pattern;
static List<String> replacements = new ArrayList<>();

static {
    StringBuilder sb = new StringBuilder();
    for (List<Pair<String, String>> mapping : Arrays.asList(
            fieldsMappings, methodsMappings, paramsMappings)) {
        for (Pair<String, String> pair : mapping) {
            sb.append("(");
            sb.append(pair.key);
            sb.append(")|");
            replacements.append(Matcher.quoteReplacement(pair.value));
        }
    }
    // Remove trailing "|" character in regexp.
    if (sb.length() > 0) {
        sb.setLength(sb.length() - 1);
    }
    pattern = Pattern.compile(sb.toString());
}

public static String deobfuscate(String s) {
    StringBuffer sb = new StringBuffer();
    Matcher matcher = pattern.matcher(s);
    while (matcher.find()) {
        // Figure out which key matched and fetch the corresponding replacement.
        String replacement = null;
        for (int i = 0; i < replacements.size(); i++) {
            if (matcher.group(i) != null) {
                replacement = replacements.get(i);
                break;
            }
        }
        if (replacement == null) {
            // Should never happen.
            throw new RuntimeException("Regexp matched, but no group matched");
        }
        matcher.appendReplacement(sb, replacement);
    }
    matcher.appendTail(sb);
    return sb.toString();
}

The above code assumes that each key is a regexp. If keys are instead fixed strings, there's no need to use regexp groups to identify which key matched, you can use a map instead. This would look like

static Pattern pattern;
static Map<String, String> replacements = new HashMap<>();

static {
    StringBuilder sb = new StringBuilder();
    for (List<Pair<String, String>> mapping : Arrays.asList(
            fieldsMappings, methodsMappings, paramsMappings)) {
        for (Pair<String, String> pair : mapping) {
            sb.append(Pattern.quote(pair.key));
            sb.append("|");
            replacements.put(pair.key, Matcher.quoteReplacement(pair.value));
        }
    }
    // Remove trailing "|" character in regexp.
    if (sb.length() > 0) {
        sb.setLength(sb.length() - 1);
    }
    pattern = Pattern.compile(sb.toString());
}

public static String deobfuscate(String s) {
    StringBuffer sb = new StringBuffer();
    Matcher matcher = pattern.matcher(s);
    while (matcher.find()) {
        matcher.appendReplacement(sb, replacements.get(matcher.group()));
    }
    matcher.appendTail(sb);
    return sb.toString();
}

Note that replacements are quoted with Matcher.quoteReplacement when building the replacement list/map, to ensure replacements are treated literally, since regexp backreferences won't work anyway when building a composite regexp from all the keys. If you depend on backreferences in the replacements, this approach won't work.

Be warned that the code above hasn't been tested (or even compiled).

  1. replaceAll() method in String is slow, since the regex Patterns are repeatedly compiled for all keys. An idea is to cache 'compiled Patterns' instead of Strings and then repeatedly run replaceAll. At least this will be much faster than this current version.

  2. A possible idea is to optimize 'examination of s' with prefix trie.

For example, suppose s looks like

'qqq aaa 111 bbb 222 ccc rgege'

and the keys are aaa bbb and ccc. Then your current algorithm examine characters of s 3 times. But if you examine characters one by one and looks up the prefix trie, and keeps indices of matched positions and values, it only takes one time examination of s to know that

replace aaa with aaaValue at 4, replace bbb at 12, and replace ccc at 20.

This would probably also significantly improve speed. There are Java libraries like concurrent-tree jar for this. If the performance is not as expected, there are some programming practice codes online for tries, and the performance would be optimal since trie implementation with primitive arrays can be found.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM