简体   繁体   中英

Java regex performance for long regular expressions

I want to check whether a set of strings contains a set of words.

String[] text = new String[10000];
text[0] = "John was killed in London";
text[1] = "Joe was murdered in New York";
....

String regex = "killed | killing | dead |murdered | beheaded | kidnapped | arrested | apprehended .....

I have a long list of words separated by OR operator as shown above and I want to check if each sentence contains at least one word in the list.

I know how to use Pattern and Matcher.

What i want know is which is good for performance out of the following methods,

  1. having a long list of words separated by OR operator in a single regex
  2. having multiple regex's (by dividing the list into 2 or 3 or ?) and do the matching in separate steps

Or, is there any other way to do this faster?

I think the fastest way to do this is to put all the words in a set (like hashset, or treeset). Then process each line and check for each word whether it is in the set. For example using HashSet each match takes O(1) average time. For tree set each match is O(Log n) where n is the number of words in the set. Another alternative is to use a Trie data structure. Put the words into a Trie and check for each word whether it is in the set. If case is irrelevant then store the uppercase in the set, and convert the word to uppercase before checking.

As regex in java compiled into internal data structure, 1) multiple regex is not a good option. 2) One regex with multiplle list is also not good option because of compilation time.

It would be preferable if you use, any data structure for this lists or hashMap.

If you have a lot of phrases and many keywords, it might be better to parallelize the matching instead of using regex . This is indeed much faster than using a regex in a loop on the same processor.

First you need one processing class , which is submitted to individual work threads :

final class StringMatchFinder implements Runnable {

    private final String text;
    private final Collection<Match> results;

    public StringMatchFinder(final String text, final Collection<Match> results) {
        this.text = text;
        this.results = results;
    }

    @Override
    public void run() {
        for (final String keyword : keywords) {
            if (text.contains(keyword)) {
                results.add(new Match(text, keyword));
            }
        }
    }
}

Now you need a ExecutorService :

final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

And then process the phrases:

public void processText(List<String> texts) {
    final Collection<Match> results = new ConcurrentLinkedQueue<Match>();
    final Collection<Future<?>> futures = new LinkedList<Future<?>>();
    for (final String text : texts) {
        futures.add(es.submit(new StringMatchFinder(text, results)));
    }
    es.shutdown();
    try {
        es.awaitTermination(1, TimeUnit.DAYS);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    for (final Match match : results) {
        System.out.println(match.getOriginalText() + " ; keyword found:" + match.getKeyword());
        //or write them to a file
    }
}

The loop over the futures is to check for processing errors. Results are saved in a list of matches


Here is a complete example.

The class Match

public class Match {
    private String originalText;
    private String keyword;

    public Match(String originalText, String keyword) {
        this.originalText = originalText;
        this.keyword = keyword;
    }

    public void setOriginalText(String originalText) {
        this.originalText = originalText;
    }

    public String getOriginalText() {
        return originalText;
    }

    public void setKeyword(String keyword) {
        this.keyword = keyword;
    }

    public String getKeyword() {
        return keyword;
    }
}

The Processor class

public class Processor {
    final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
    private Collection<String> keywords;

    public Processor(Collection<String> keywords) {
        this.keywords = keywords;
    }

    final class StringMatchFinder implements Runnable {

        private final String text;
        private final Collection<Match> results;

        public StringMatchFinder(final String text, final Collection<Match> results) {
            this.text = text;
            this.results = results;
        }

        @Override
        public void run() {
            for (final String keyword : keywords) {
                if (text.contains(keyword)) {
                    results.add(new Match(text, keyword));
                }
            }
        }
    }

    public void processText(List<String> texts) {
        final Collection<Match> results = new ConcurrentLinkedQueue<Match>();
        final Collection<Future<?>> futures = new LinkedList<Future<?>>();
        for (final String text : texts) {
            futures.add(es.submit(new StringMatchFinder(text, results)));
        }
        es.shutdown();
        try {
            es.awaitTermination(1, TimeUnit.DAYS);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        for (final Match match : results) {
            System.out.println(match.getOriginalText() + " ; keyword found:" + match.getKeyword());
        }
    }
}

A main class for testing

public class Main {
    public static void main(String[] args) {
        List<String> texts = new ArrayList<String>();
        List<String> keywords = new ArrayList<String>();

        texts.add("John was killed in London");
        texts.add("No match test!");
        texts.add("Joe was murdered in New York");
        texts.add("Michael was kidnapped in York");
        //add more

        keywords.add("murdered");
        keywords.add("killed");
        keywords.add("kidnapped");

        Processor pp = new Processor(keywords);
        pp.processText(texts);
    }
}

To understand the performance of this, you need to understand how regular expressions work. They are much more sophisticated than Java "contains" which can have quadratic performance with respect to the string in the worst case. Regular expressions compile down to a graph which you traverse with every character from the input string. That means, if you have multiple words and construct a proper regex statement, you can get much better performance if you craft your regex correctly or use a regex optimiser (eg. https://www.dcode.fr/regular-expression-simplificator ). I'm not sure if Java optimises your regex out of the box. Here is a visual example of a correctly compiled regex graph.

在此输入图像描述

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM