简体   繁体   English

正则表达式Java,为什么此正则表达式这么慢?

[英]Regular Expressions Java, why is this regex so slow?

I just created an regular expression in Java, I want to look for expressions in about 5000 tweets, each tweet takes almost one second, why is it so slow?? 我刚刚用Java创建了一个正则表达式,我想在大约5000条推文中查找表达式,每条推文几乎要花一秒钟,为什么这么慢?

If it's too complex that expression or there're something on it that it's too expensive to execute? 如果该表达式太复杂或执行某项操作的成本太高? I'd hope to process the whole data in less than 5 seconds for sure. 我希望可以在不到5秒的时间内处理整个数据。

The code is: 代码是:

public class RegularExpression {
    public static void main(String[] args) throws IOException {                
        String filter = ".*\"created_at\":\"(.*?)\".*\"content\":\"(.*?word.*?)\",\"id\".*";       
        Pattern pattern = Pattern.compile(filter);
        List<String> tweets = FileUtils.readLines(new File("/tmp/tweets"));

        System.out.println("Start with " + tweets.size() );
        int i=0;
        for (String t : tweets){

            Matcher matcher = pattern.matcher(t);                      
            matcher.find();            
            System.out.println(i++);

        }
        System.out.println("End");
    }
}

The input are JSON tweets. 输入是JSON tweets。 If I do my RE simpler it runs faster, but, I think that my RE isn't so heavy. 如果我简化我的RE,它的运行速度会更快,但是,我认为我的RE并不是那么繁重。 I'd like to understand why this's happenng, I was just checking a test. 我想了解为什么会这样,我只是在检查测试。

UPDATED: 更新:

The reason why I'm using RE when I try to parse JSON, it's because in the end, I could get a simple text, and XML, a JSON format, a log from any kind of server. 我尝试解析JSON时使用RE的原因是,最终,我可以从任何类型的服务器中获取简单的文本,XML,JSON格式的日志。 So, I have to work with my input like plain-text. 因此,我必须像纯文本一样处理我的输入。

Your regex is very imprecise in what it allows to match. 您的正则表达式在允许匹配的内容上非常不精确。 Most importantly, you seem to be wanting to match text between quotes, but you're allowing quote characters to be part of the match ( .* can and will happily match " !). This sets you up for a potentially very high number of permutations a regex engine has to check before declaring failure/success, depending on your input. 最重要的是,您似乎想在引号之间匹配文本,但是您允许将引号字符作为匹配项的一部分( .*可以并且将愉快地匹配"" 。这使您有可能使用大量的声明输入失败/成功之前,正则表达式引擎必须检查的排列,具体取决于您的输入。

If in fact quotes may not be part of the text that you're currently matching with .* , then use [^"]* instead; that should speed it up a lot: 如果实际上引号可能不是您当前与.*匹配的文本的一部分,请改用[^"]* ;这样可以大大提高速度:

"[^\"]*\"created_at\":\"([^\"]*)\"[^\"]*\"content\":\"([^\"]*word[^\"]*)\",\"id\"[^\"]*"

Since you already know that your input is JSON, you should not use regular expressions to interpret it. 由于您已经知道输入是JSON,因此不应使用正则表达式来解释它。 Use a JSON parser, then you don't have to care about anything like escaping special characters. 使用JSON解析器,则不必担心转义特殊字符之类的事情。

I'm not entirely sure why it takes almost a full second to process a single tweet, but lazy quantifiers are more expensive than a "match anything except" approach to a "match until"-scenario. 我不完全确定为什么要花几乎一整秒的时间处理一条推文,但是惰性量词要比“匹配除”以外的方法匹配到“直到...”的情况更昂贵。

More information here: http://blog.stevenlevithan.com/archives/greedy-lazy-performance 此处的更多信息: http : //blog.stevenlevithan.com/archives/greedy-lazy-performance

You could try avoiding the use of lazy quantifiers, or just use a JSON parser instead, as it would likely be faster/cleaner. 您可以尝试避免使用惰性量词,或者仅使用JSON解析器,因为它可能更快/更干净。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM