简体   繁体   English

正则表达式非常慢

[英]Regular expressions are extremly slow

Pre: I'm trying to extract different types of parts from a big array using regexp. 一篇我正尝试使用正则表达式从大型数组中提取不同类型的parts This operation is performed in AsyncTask . 此操作在AsyncTask执行。 part.plainname is a string, 256 char maximum. part.plainname是一个字符串,最多256个字符。 item_pattern looks like "^keyword.*?$" item_pattern看起来像"^keyword.*?$"

Problem: I found the method, that's slows everything: 问题:我找到了方法,这会使一切变慢:

public int defineItemAmount(NFItem[] parts, String item_pattern){
    System.out.println("STAMP2");
    int casecount = 0;
    for (NFItem part : parts) {
        if (testItem(part.plainname, item_pattern))
            ++casecount;
    }
    System.out.println("STAMP3");
    return casecount;
}

public boolean testItem(String testString, String item_pattern){
    Pattern p = Pattern.compile(item_pattern);
    Matcher m = p.matcher(testString);
    return m.matches();
}

There's only 950 parts , but it works horribly slow: 只有950个parts ,但工作速度非常慢:

02-25 11:34:51.773    1324-1343/com.nfe.unsert.dns_pc_creator I/System.out﹕ STAMP2

02-25 11:35:18.094    1324-1343/com.nfe.unsert.dns_pc_creator I/System.out﹕ STAMP3

20 seconds just for the counting. 20秒仅用于计数。 testItem is used a lot, around 15* parts . testItem使用,大约有15 *个parts So the whole app is working more than 15 minutes. 因此,整个应用程序的工作时间超过15分钟。 While almost the same java program (not for android app) finishes in less than 30 seconds. 虽然几乎相同的Java程序(不适用于android应用)在30秒内完成。

Question: What am I doing wrong? 问题:我做错了什么? Why simple regexp operationg taking so long? 为什么简单的正则表达式操作要花这么长时间?

You can pre-compile the pattern: 您可以预编译模式:

public static int defineItemAmount(NFItem[] parts, String item_pattern){
    System.out.println("STAMP2");
    Pattern pattern = Pattern.compile(item_pattern);
    int casecount = 0;
    for (NFItem part : parts) {
        if (testItem(part.plainname, pattern))
            ++casecount;
    }
    System.out.println("STAMP3");
    return casecount;
}

public static boolean testItem(String testString, Pattern pattern){
    Matcher m = pattern.matcher(testString);
    return m.matches();
}

If you are looking for a string that begins with a keyword, you don't need to use the matches method with this kind of pattern ^keyword.*?$ : 如果要查找以关键字开头的字符串,则无需将matches方法与这种模式^keyword.*?$

  • first the non-greedy quantifier is useless and may slowdown the regex engine for nothing, a greedy quantifier will give you the same result. 首先,非贪婪的量词是无用的,并且可能使正则表达式引擎毫无作用地变慢,贪婪的量词会给您相同的结果。
  • since the matches method is by default anchored, anchors are not needed, you can remove them. 由于默认情况下, matches方法是锚定的,因此不需要锚定,因此可以将其删除。
  • you are only interested by the begining of the string, so in this case the lookingAt method is more appropriate since it doesn't care of what happens at the end of the string. 您只对字符串的lookingAt感兴趣,因此在这种情况下, lookingAt方法更合适,因为它并不关心字符串末尾会发生什么。
  • as other answers notice it, if the same pattern is used several times, try to compile it once and for all outside the testItem function. 正如其他答案所指出的那样,如果多次使用同一模式,请尝试一次在testItem函数外部进行编译。 But if it isn't the case don't compile it at all. 但是,如果不是这种情况,则根本不进行编译。
  • if keyword is a literal string and not a subpattern, don't use regex at all and use indexOf to check if the keyword is at the index 0. 如果keyword是文字字符串而不是子模式,则不要使用正则表达式,而要使用indexOf检查关键字是否在索引0处。

You don't need to compile the pattern each time. 您无需每次都编译模式。 Rather, do it once on initialisation. 而是在初始化时执行一次。

But, due to their generality, regular expressions are not fast, and they are not designed to be. 但是,由于它们的通用性,正则表达式并不是很快,而且它们也并非如此。 You might be better off using a specific string splitting technique if the data are sufficiently regular. 如果数据足够规则,则使用特定的字符串拆分技术可能会更好。

  1. Regexes are usually slow because they have a lot of things (like synchronization ) involved in their construction. 正则表达式通常是缓慢的 ,因为他们有很多的参与他们建设的东西(如同步 )。

  2. Don't call a separate method in the loop (which might prevent certain optimizations). 不要在循环中调用单独的方法(这可能会阻止某些优化)。 Let the VM optimize the for loop. 让虚拟机优化 for循环。 Use this and check performance : 使用它并检查性能:

      Pattern p = Pattern.compile(item_pattern); // compile pattern only once for (NFItem part : parts) { if (testItem(part.plainname, item_pattern)) ++casecount; } Matcher m = p.matcher(testString); boolean b = m.matches(); ... 

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM