简体   繁体   中英

Find java comments (multi and single line) using regex

I found the following regex online at http://regexlib.com/

(\/\*(\s*|.*?)*\*\/)|(\/\/.*)

It seems to work well for the following matches:

// Compute the exam average score for the midterm exam

/**
* The HelloWorld program implements an application that
*/

BUT it also tends to match

http://regexr.com/foo.html?q=bar

at least starting at the //

I'm new to regex and a total infant, but I read that if you put a caret at the beginning it forces the match to start at the beginning of the line, however this doesn't seem to work on RegExr.

I'm using the following:

^(\/\*(\s*|.*?)*\*\/)|(\/\/.*)$

The regex you are looking for is one that allows the comment beginning ( // or /* ) to appear anywhere except in each of the regexps that result in tokens that can contain those substrings inside. If you look at the lexical structure of java language , you'll see that the only lexical element that can contain a // or a /* inside is the string literal, so to match a comment inside a string you have to match all the string (for not having a string literal before your match that happens to begin a string literal --- and contain your comment inside)

So, the string before your comment should be composed of any valid string that don't begin a string literal (without ending) and so, it can be rounded by any number of string literals with any string that doesn't form a string literal in between. If you consider a string literal, it should be matched by the following:

\"()*\"

and the inside of the parenthesis must be filled with something that cannot be a \\n , a single " , a single \\ , and also not a unicode literal \\uxxxx that results in a valid " (java forbids to use normal java characters to be encoded as unicode sequences, so this last case doesn't apply) but can be a escaped \\\\ or a escaped \\" , so this leads to

\"([^\\\"\n]|\\.)*\"

and this can be repeated any number of times optionaly, and preceded of any character not being a " (that should begin the last part considered):

([^\\"](\"([^\\\"\n]|\\.)*\")?)*

well, the previous part to our valid string should be matched by this string, and then comes the comment string, it can be any of two forms:

\/\/[^\n]*$

or

/\*([^\*]|\*[^\/])*\*\/

(this is, a slash, an asterisk (escaped), and any number of things that can be: either something different than a * or * followed by something not a / , to finally reach a */ sequence)

These can be grouped in an alternative group, as in:

(\/\/[^\n]*\n|\/\*([^\*]|\*[^\/])*\*\/)

finally, our expression shows:

^([^\\"](\"([^\\\"\n]|\\.)*\")?)*(\/\/[^\n]*|\/\*([^\*]|\*[^/])*\*\/)

But you should be careful that your matched comment begins not at the beginning, but in the 4th group (in the mark of the 4th left parenthesis) and the regexp should match the string from the beginning, see demo

Note

Think you are matching not only the comment, but the text before. This makes the result match to be composed of what is before the matching you want and the matched. Also think that if you try this regexp with several comments in sequence, it will match only the last, as we have not covered the case of a /* ... /* .... */ sequence (the comment is also something that can be embedded into a comment, but considering also this case will make you hate regexps forever. The correct way to cope with this problem is to write a lex/flex specification to get the java tokens and you'll only get them, but this is out of scope in this explanation. See an probably valid example here .

You can try this pattern:

(?ms)^[^'"\n]*?(?:(?:"(?:\\.|[^"])*"|'\\?.')[^'"\n]*?)*((?:(?://[^\n]*|/\*.*?\*/)[ \t]*)+)

This captures comments in group 1, but only if the comment is not inside a string. Demo.


Breakdown:

(?ms)                 multiline flag, makes ^ match at the start of a line
                      singleline flag makes . match newlines
^                     start of line
[^'"\n]*?             match anything but " or ' or newline
(?:                   then, any number strings:
    (?:
        "             start with a quote...
        (?:           ...followed by any number of...
            \\.       ...a backslash and the escaped character
        |             or
            [^"]      any character other than "
        )*
        "             ...and finally the closing quote
    |                 or...
        '\\?.'        a single character in single quotes, possibly escaped
    )
    [^'"\n]*?         and everything up to the next string or newline
)*
(                     finally, capture (any number of) comments:
    (?:
        (?:           either...
            //[^\n]*  a single line comment
        |             or
            /\*.*?\*/ a multiline comment
        )
        [ \t]*        and any subsequent comments if only separated by whitespace
    )+
)

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM