简体   繁体   中英

Java Regex find Oracle Single Line comments Except in a String

Find Oracle single line comments except the ones that appear inside a string.

For example:

-- This is a valid single line comment

But

'This is a string -- and it is not a comment';

I am using this regex to find single line comments

--.*$

a few cases can be handled but there are several complex ones. You can use this script for reference

-- this is a single line comment

CREATE OR REPLACE PROCEDURE "MAIL_WITH_ATTACHMENT" ( ) 
IS    
tmp varchar(2) ; -- this is a comment 
tmp1 varchar(2) := 'some texxt'; -- this is another comment
tmp2 varchar(3) := 'some more --text'; -- this is one more comment
tmp3 varchar(4) := 'this regex isn't --working properly'; -- Don't you think this is another comment
BEGIN

          '--This is a Mime message, which your current mail reader may not' || crlf ||
          ' some more -- characters in a string';

    mesg:= crlf ||
          '--This is a Mime message, which your current mail reader may not' || crlf ||
      ' some more -- characters in a string';
END; 

Result must be this

[1] : -- this is a single line comment
[2] : -- this is a comment 
[3] : -- this is another comment
[4] : -- this is one more comment
[5] : -- Don't you think this is another comment

Thanks

Personally, I'd use an SQL parser to strip these comments. The problem with regex is that it's not really aware of its surroundings: regex has a hard time figuring out if a single quote is inside a comment, or if -- is inside a string literal.

You can circumvent this by using a regex that matches from the start of a line and match string literals as well. Making it behave more like a lexical analyzer (the first stage of parsing).

Such a regex could look like this:

(?m)^((?:(?!--|').|'(?:''|[^'])*')*)--.*$

A quick break down of the regex:

(?m)                 # enable multi-line mode
^                    # match the start of the line
(                    # start match group 1
  (?:                #   start non-capturing group 1
    (?!--|').        #     if there's no '--' or single quote ahead, match any char (except a line break)
    |                #     OR
    '(?:''|[^'])*'   #     match a string literal
  )*                 #   end non-capturing group 1 and repeat it zero or more times
)                    # end match group 1
--.*$                # match a comment all the way to the end of the line

In plain English that would read like: from each start of a line, try to match zero or more:

  • string literals ( '(?:''|[^'])*' );
  • or any character as long as it's not a single quote, a line break char or a - that is a part of a comment ( (?!--|'). ).

and store this match in group 1. Then match a comment ( --.*$ ).

So now all you need to do is replace this pattern with whatever is matched in group 1. A demo:

String sql = "-- this is a single line comment\n" +
             "\n" +
             "CREATE OR REPLACE PROCEDURE \"MAIL_WITH_ATTACHMENT\" ( ) \n" +
             "IS    \n" +
             "tmp varchar(2) ; -- this is a comment \n" +
             "tmp1 varchar(2) := 'some texxt'; -- this is another comment\n" +
             "tmp2 varchar(3) := 'some more --text'; -- this is one more comment\n" +
             "tmp3 varchar(4) := 'this regex isn''t --working properly'; -- Don't you think this is another comment\n" +
             "BEGIN\n" +
             "\n" +
             "          '--This is a Mime message, which your current mail reader may not' || crlf ||\n" +
             "          ' some more -- characters in a string';\n" +
             "\n" +
             "    mesg:= crlf ||\n" +
             "          '--This is a Mime message, which your current mail reader may not' || crlf ||\n" +
             "      ' some more -- characters in a string';\n" +
             "END; ";
String stripped = sql.replaceAll("(?m)^((?:(?!--|').|'(?:''|[^'])*')*)--.*$", "$1[REMOVED COMMENT]");
System.out.println(stripped);

which will print:

[REMOVED COMMENT]

CREATE OR REPLACE PROCEDURE "MAIL_WITH_ATTACHMENT" ( ) 
IS    
tmp varchar(2) ; [REMOVED COMMENT]
tmp1 varchar(2) := 'some texxt'; [REMOVED COMMENT]
tmp2 varchar(3) := 'some more --text'; [REMOVED COMMENT]
tmp3 varchar(4) := 'this regex isn''t --working properly'; [REMOVED COMMENT]
BEGIN

          '--This is a Mime message, which your current mail reader may not' || crlf ||
          ' some more -- characters in a string';

    mesg:= crlf ||
          '--This is a Mime message, which your current mail reader may not' || crlf ||
      ' some more -- characters in a string';
END; 

EDIT

And if you only want to extract the comments, wrap the capture group around --.*$ and use a Pattern & Matcher to find() the matches:

Matcher m = Pattern.compile("(?m)^(?:(?!--|').|'(?:''|[^'])*')*(--.*)$").matcher(sql);
while(m.find()) {
  System.out.println(m.group(1));
}

which will print:

-- this is a single line comment
-- this is a comment 
-- this is another comment
-- this is one more comment
-- Don't you think this is another comment

This should help. If you read line by line;

   str = str.replaceAll("'{1}.*'{1}", "").replaceFirst(".*--", "--");

Input: -sd '--asdsa ---asdsadasdsad' || ' asdsad' || 'asdsadasd '--here x something

Output: --here x something

Edit: Final version after 3 edit:)

This regex should work fine:

Pattern p = Pattern.compile("^[^']*('[^']*'[^']*)*(--.*)$");

except for the case [5]. But before starting to overcomplicate the regex, are you sure that Oracle doesn't complain about that string?

EDIT

This is the code I've used to test the regex

String[] text =
    {
        "-- this is a single line comment",
        "",
        "CREATE OR REPLACE PROCEDURE \"MAIL_WITH_ATTACHMENT\" ( ) ",
        "IS    ",
        "tmp varchar(2) ; -- this is a comment ",
        "tmp1 varchar(2) := 'some texxt'; -- this is another comment",
        "tmp2 varchar(3) := 'some more --text'; 'blah --blah' -- this is one more comment",
        "tmp3 varchar(4) := 'this regex isn't --working properly'; -- Don't you think this is another comment",
        "BEGIN",
        "",
        "          '--This is a Mime message, which your current mail reader may not' || crlf ||",
        "          ' some more -- characters in a string';",
        "",
        "    mesg:= crlf ||",
        "          '--This is a Mime message, which your current mail reader may not' || crlf ||",
        "      ' some more -- characters in a string';", "END; ", };

Pattern p = Pattern.compile("^[^']*('[^']*'[^']*)*(--.*)$");
Matcher m = p.matcher("");

for (String s : text) {
  m.reset(s);
  if (m.find()) {
    System.out.println(m.group(m.groupCount()));
  }
}

And here's the output:

-- this is a single line comment
-- this is a comment 
-- this is another comment
-- this is one more comment
--working properly'; -- Don't you think this is another comment

As you can see, the last line of the output is "wrong". But, as you said, Oracle doesn't like such a string either. Once you correct isn't into isn''t , also the outoput will be correct.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM