简体   繁体   English

长正则表达式的Java正则表达式性能

[英]Java regex performance for long regular expressions

I want to check whether a set of strings contains a set of words. 我想检查一组字符串是否包含一组单词。

String[] text = new String[10000];
text[0] = "John was killed in London";
text[1] = "Joe was murdered in New York";
....

String regex = "killed | killing | dead |murdered | beheaded | kidnapped | arrested | apprehended .....

I have a long list of words separated by OR operator as shown above and I want to check if each sentence contains at least one word in the list. 我有一个由OR运算符分隔的单词列表,如上所示,我想检查每个句子是否包含列表中的至少一个单词。

I know how to use Pattern and Matcher. 我知道如何使用Pattern和Matcher。

What i want know is which is good for performance out of the following methods, 我想知道的是哪种方法对以下方法的性能有好处,

  1. having a long list of words separated by OR operator in a single regex 在一个正则表达式中有一个由OR运算符分隔的单词列表
  2. having multiple regex's (by dividing the list into 2 or 3 or ?) and do the matching in separate steps 有多个正则表达式(通过将列表分成2或3或?)并在单独的步骤中进行匹配

Or, is there any other way to do this faster? 或者,还有其他方法可以更快地完成此操作吗?

I think the fastest way to do this is to put all the words in a set (like hashset, or treeset). 我认为最快的方法是将所有单词放在一个集合中(如hashset或treeset)。 Then process each line and check for each word whether it is in the set. 然后处理每一行并检查每个单词是否在集合中。 For example using HashSet each match takes O(1) average time. 例如,使用HashSet,每个匹配需要O(1)个平均时间。 For tree set each match is O(Log n) where n is the number of words in the set. 对于树集,每个匹配是O(Log n),其中n是集合中的单词数。 Another alternative is to use a Trie data structure. 另一种方法是使用Trie数据结构。 Put the words into a Trie and check for each word whether it is in the set. 将单词放入Trie并检查每个单词是否在集合中。 If case is irrelevant then store the uppercase in the set, and convert the word to uppercase before checking. 如果大小写无关紧要,则将大写字母存储在集合中,并在检查之前将单词转换为大写。

As regex in java compiled into internal data structure, 1) multiple regex is not a good option. 由于java中的正则表达式编译成内部数据结构,1)多个正则表达式不是一个好的选择。 2) One regex with multiplle list is also not good option because of compilation time. 2)由于编译时间的原因,一个带有多列表的正则表达式也不是一个好的选择。

It would be preferable if you use, any data structure for this lists or hashMap. 如果您使用此列表或hashMap的任何数据结构,那将是更好的选择。

If you have a lot of phrases and many keywords, it might be better to parallelize the matching instead of using regex . 如果您有很多短语和许多关键字,那么并行化匹配而不是使用regex可能会更好。 This is indeed much faster than using a regex in a loop on the same processor. 这确实比在同一处理器上使用循环中的regex快得多。

First you need one processing class , which is submitted to individual work threads : 首先,您需要一个处理classclass提交给各个work threads

final class StringMatchFinder implements Runnable {

    private final String text;
    private final Collection<Match> results;

    public StringMatchFinder(final String text, final Collection<Match> results) {
        this.text = text;
        this.results = results;
    }

    @Override
    public void run() {
        for (final String keyword : keywords) {
            if (text.contains(keyword)) {
                results.add(new Match(text, keyword));
            }
        }
    }
}

Now you need a ExecutorService : 现在你需要一个ExecutorService

final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());

And then process the phrases: 然后处理这些短语:

public void processText(List<String> texts) {
    final Collection<Match> results = new ConcurrentLinkedQueue<Match>();
    final Collection<Future<?>> futures = new LinkedList<Future<?>>();
    for (final String text : texts) {
        futures.add(es.submit(new StringMatchFinder(text, results)));
    }
    es.shutdown();
    try {
        es.awaitTermination(1, TimeUnit.DAYS);
    } catch (InterruptedException e) {
        e.printStackTrace();
    }

    for (final Match match : results) {
        System.out.println(match.getOriginalText() + " ; keyword found:" + match.getKeyword());
        //or write them to a file
    }
}

The loop over the futures is to check for processing errors. 期货的循环是检查处理错误。 Results are saved in a list of matches 结果保存在matches list matches


Here is a complete example. 这是一个完整的例子。

The class Match 班级Match

public class Match {
    private String originalText;
    private String keyword;

    public Match(String originalText, String keyword) {
        this.originalText = originalText;
        this.keyword = keyword;
    }

    public void setOriginalText(String originalText) {
        this.originalText = originalText;
    }

    public String getOriginalText() {
        return originalText;
    }

    public void setKeyword(String keyword) {
        this.keyword = keyword;
    }

    public String getKeyword() {
        return keyword;
    }
}

The Processor class Processor

public class Processor {
    final ExecutorService es = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());
    private Collection<String> keywords;

    public Processor(Collection<String> keywords) {
        this.keywords = keywords;
    }

    final class StringMatchFinder implements Runnable {

        private final String text;
        private final Collection<Match> results;

        public StringMatchFinder(final String text, final Collection<Match> results) {
            this.text = text;
            this.results = results;
        }

        @Override
        public void run() {
            for (final String keyword : keywords) {
                if (text.contains(keyword)) {
                    results.add(new Match(text, keyword));
                }
            }
        }
    }

    public void processText(List<String> texts) {
        final Collection<Match> results = new ConcurrentLinkedQueue<Match>();
        final Collection<Future<?>> futures = new LinkedList<Future<?>>();
        for (final String text : texts) {
            futures.add(es.submit(new StringMatchFinder(text, results)));
        }
        es.shutdown();
        try {
            es.awaitTermination(1, TimeUnit.DAYS);
        } catch (InterruptedException e) {
            e.printStackTrace();
        }

        for (final Match match : results) {
            System.out.println(match.getOriginalText() + " ; keyword found:" + match.getKeyword());
        }
    }
}

A main class for testing 一个main的测试类

public class Main {
    public static void main(String[] args) {
        List<String> texts = new ArrayList<String>();
        List<String> keywords = new ArrayList<String>();

        texts.add("John was killed in London");
        texts.add("No match test!");
        texts.add("Joe was murdered in New York");
        texts.add("Michael was kidnapped in York");
        //add more

        keywords.add("murdered");
        keywords.add("killed");
        keywords.add("kidnapped");

        Processor pp = new Processor(keywords);
        pp.processText(texts);
    }
}

To understand the performance of this, you need to understand how regular expressions work. 要了解这一点的性能,您需要了解正则表达式的工作原理。 They are much more sophisticated than Java "contains" which can have quadratic performance with respect to the string in the worst case. 它们比Java“包含”复杂得多,在最坏的情况下,它可以具有相对于字符串的二次性能。 Regular expressions compile down to a graph which you traverse with every character from the input string. 正则表达式编译为一个图形,您可以使用输入字符串中的每个字符进行遍历。 That means, if you have multiple words and construct a proper regex statement, you can get much better performance if you craft your regex correctly or use a regex optimiser (eg. https://www.dcode.fr/regular-expression-simplificator ). 这意味着,如果你有多个单词并构造一个正确的正则表达式语句,如果你正确地制作正则表达式或使用正则表达式优化器(例如https://www.dcode.fr/regular-expression-simplificator) ,你可以获得更好的性能。 )。 I'm not sure if Java optimises your regex out of the box. 我不确定Java是否可以优化您的正则表达式。 Here is a visual example of a correctly compiled regex graph. 这是正确编译的正则表达式图的可视示例。

在此输入图像描述

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM