简体   繁体   English

什么是用于检测Java代码中的for循环和while循环的正则表达式

[英]what is Regular expression for detecting a for loop and while loop in Java code

What is regular expression for detecting a for loo and another one for detecting while loop. 什么是正则表达式,用于检测a是否为loo,另一个是用于检测while循环。 want to detect for(--;--;--) and while (--comparison operator --) constructs. 想要检测for(--;--;--) while (--comparison operator --)while (--comparison operator --)构造。

You can't do this reliably with a regex. 使用正则表达式不能可靠地做到这一点。 You need to parse the code with a proper parser. 您需要使用适当的解析器来解析代码。

You folks who are using \\s in Java to detect whitespace in Java code are making at least one and maybe several mistakes. 在Java中使用\\s来检测Java代码中的空格的人们正在犯至少一个,甚至几个错误。

First of all, the Java compiler's idea of whitespace in its own doesn't line up with what \\s matches in Java. 首先,Java编译器本身对空格的想法与Java中\\s匹配方式不一致。 You may access the Java Character.isWhitespace() through the \\p{JavaWhitespace} property. 您可以通过\\p{JavaWhitespace}属性访问Java Character.isWhitespace()

Secondly, Java does not allow \\s to match Unicode whitespace; 其次,Java不允许\\s匹配Unicode空格; as implemented in the Java Pattern class, \\s only matches ASCII whitespace. 按照Java Pattern类中的实现, \\s仅与ASCII空格匹配。 In fact, Java does not support any property that corresponds to Unicode whitespace. 实际上,Java不支持与Unicode空格相对应的任何属性。

Here's a table showing some of the problem areas: 下表显示了一些问题区域:

                      000A    0085    00A0    2029
                      J  P    J  P    J  P    J  P
                \s    1  1    0  1    0  1    0  1
               \pZ    0  0    0  0    1  1    1  1
            \p{Zs}    0  0    0  0    1  1    0  0
         \p{Space}    1  1    0  1    0  1    0  1
         \p{Blank}    0  0    0  0    0  1    0  0
    \p{Whitespace}    -  1    -  1    -  1    -  1
\p{javaWhitespace}    1  -    0  -    0  -    1  -
 \p{javaSpaceChar}    0  -    0  -    1  -    1  -

What you're looking at on the x-axis is four different code points: 您在x轴上看到的是四个不同的代码点:

U+000A: LINE FEED (LF)
U+0085: NEXT LINE (NEL)
U+00A0: NO-BREAK SPACE
U+2029: PARAGRAPH SEPARATOR

The y-axis has eight different regex tests, mostly properties. y轴具有八个不同的正则表达式测试,主要是属性。 For each of those code points, there is both a J-results column for Java and a P-results column for Perl or any other PCRE-based regex engine. 对于每个代码点,Java都有一个J-results列,而Perl或任何其他基于PCRE的正则表达式引擎都有一个P-results列。

It's a big problem. 这是个大问题。 Java is just messed up, giving answers that are "wrong" according to existing practice and also according to Unicode. Java只是一团糟,根据现有实践以及根据Unicode,给出的答案都是“错误的”。 Plus Java doesn't even give you access to the real Unicode properties. 另外,Java甚至都不允许您访问真正的Unicode属性。 For the record, these are the code points with the Unicode whitespace property: 为了记录,这些是带有Unicode空白属性的代码点:

% unichars '\pP{Whitespace}'
0009 CHARACTER TABULATION
000A LINE FEED (LF)
000B LINE TABULATION
000C FORM FEED (FF)
000D CARRIAGE RETURN (CR)
0020 SPACE
0085 NEXT LINE (NEL)
00A0 NO-BREAK SPACE
1680 OGHAM SPACE MARK
180E MONGOLIAN VOWEL SEPARATOR
2000 EN QUAD
2001 EM QUAD
2002 EN SPACE
2003 EM SPACE
2004 THREE-PER-EM SPACE
2005 FOUR-PER-EM SPACE
2006 SIX-PER-EM SPACE
2007 FIGURE SPACE
2008 PUNCTUATION SPACE
2009 THIN SPACE
200A HAIR SPACE
2028 LINE SEPARATOR
2029 PARAGRAPH SEPARATOR
202F NARROW NO-BREAK SPACE
205F MEDIUM MATHEMATICAL SPACE
3000 IDEOGRAPHIC SPACE

If you want, feel free to grab the unichars program and play around with it and its companion programs, uniprops and uninames . 如果你想,随时抢单字符程序和玩它和它的配套程序, unipropsuninames I haven't added the Java-only properties yet, but I intend to. 我还没有添加仅Java的属性,但是我打算这样做。 There are just too many nasty surprises like those described above. 像上面描述的那样,有太多令人讨厌的惊喜。

For kicks and grins, would you believe there's a \\p{javaJavaIdentifierStart} property in Java? 对于\\p{javaJavaIdentifierStart} ,您相信Java中有一个\\p{javaJavaIdentifierStart}属性吗? I kid you not. 我不骗你 But you wouldn't believe the characters the compiler actually lets you use in identifiers; 但是您不会相信编译器实际上允许您在标识符中使用的字符。 really you wouldn't. 真的你不会。 Somebody wasn't paying attention. 有人没有注意。 Again. 再次。 :( :(

You can parse almost anything with modern (PCRE-style) regex. 您可以使用现代(PCRE风格)正则表达式解析几乎所有内容。 However, parsing certain things correctly is often pathologically difficult. 但是, 正确地解析某些事物通常在病理上是困难的。 It's easy to build a small, terse regex to match only certain kinds of simply formatted for loops: 构建一个小巧的正则表达式以匹配某些简单格式化的for循环很容易:

for\s*\([^;]*?;[^;]*?;[^)]*?\)

But what happens when you run into something like this? 但是,当您遇到这种情况时会发生什么?

int i = 0;
for(
        String s = "for(0;1;2)";
        s.indexOf(String.valueOf(i)) != -1;
        i++ // increment the i variable ;-)
   )

Better to use a full-blown purpose-built Java parser if you need 100% reliability. 如果需要100%的可靠性,最好使用成熟的专用Java解析器。 The java.net article Source Code Analysis Using Java 6 APIs gives a jumping-off point for one way to do reliable parsing of Java source code. java.net文章“使用Java 6 API进行源代码分析”提供了一种可靠地解析Java源代码的方法的起点。


In reply to Taz's comment: 回复塔兹的评论:

I did it with .*for(.*;.*;.*).* what could be wrong with this? 我用.*for(.*;.*;.*).*做到了.*for(.*;.*;.*).*这可能是什么问题?

Assuming all the for-loops you want to match have: 假设您要匹配的所有for循环都具有:

  1. no linebreaks in them, 他们没有换行符,
  2. no embedded/trailing comments 没有嵌入/跟踪评论
  3. no "string" or 'c'haracter literals in them 其中没有“字符串”或“ c”字符文字

I think your pattern should be OK. 我认为您的模式应该可以。 You might want to allow for whitespace between the for and the opening parenthesis: 您可能需要在for和左括号之间留出空格:

.*for\s*(.*;.*;.*).*

However as tchrist points out in his answer to this question, \\s* is not a perfectly correct way to allow for whitespace in Java source code, as Java source code supports types of Unicode whitespace that \\s does not allow for. 但是,正如tchrist在回答此问题时指出的那样, \\s*并不是在Java源代码中允许空格的完美正确方法,因为Java源代码支持\\s不允许的Unicode空格类型。 Again, if you need 100% reliability, a full Java source code parser is probably a better choice. 同样,如果您需要100%的可靠性,那么完整的Java源代码解析器可能是一个更好的选择。

Make sure you turn off (or don't turn on) the "dot matches newline" option in your parser (eg DOTALL or Singleline ). 确保您关闭(或不打开)在您的解析器(如“点匹配换行符”选项DOTALL单线 )。 Otherwise your regex could match across multiple lines, which is likely to cause your regex to match incorrectly. 否则,您的正则表达式可能会跨多行匹配,这很可能导致您的正则表达式不正确匹配。

for ?\(.*?;.*?;.*?\)
while ?\(.+?\)

If the code's gonna be anything seriously complicated (Other than saying: Does this loop occur anywhere in the code) use a parser instead. 如果代码会变得非常复杂(除了说:此循环是否发生在代码中的任何地方),请改用解析器。

Why do we need these ? 我们为什么需要这些? here. 这里。 And I do need to detect that there is a comparison operator in while loop 我确实需要检测到while循环中有一个比较运算符

If I were to leave the ? 如果我要离开? out then it would match for ( for(this;that;theother) 则它将匹配for ( for(this;that;theother)

I updated the while loop to use + 我更新了while循环以使用+

I think that regular expressions given by JV contain extra question mark. 我认为合资公司给出的正则表达式包含额外的问号。

Here is my version: 这是我的版本:

for\s*\([^;]*;[^;]*;[^)]*\)

while\\s*\\(.*?\\) is correct but while\\s*\\(.*?\\)是正确的,但

while\\s*\\([^)]*\\) should be faster. while\\s*\\([^)]*\\)应该更快。

For loops are the easiest to detect: For循环最容易检测到:

for *\(.*;.*;.*)

While loops are a little trickier, as there are two ways to do it. 虽然循环有点棘手,但是有两种方法可以做到。 If you want to use the format you specify above, this should work: 如果要使用上面指定的格式,则应该可以使用:

while *\(.*(<|>|<=|>=|==|!=).*\)

However, this does not detect while conditions that depend on the boolean value of a variable, nor the boolean result from a method, so this version would be a little simpler and match more: 但是,这不会检测到何时条件取决于变量的布尔值,也不取决于方法的布尔结果,因此此版本会更简单一些,并且匹配更多:

while *\(.*\)

Regular expressions can only parse regular (Ch-3) languages. 正则表达式只能解析正则(Ch-3)语言。 Java is not a regular language, it is at least context-free (Ch-2), maybe even context-sensitive (Ch-1). Java不是常规语言,它至少是上下文无关的(Ch-2),甚至是上下文敏感的(Ch-1)。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM