简体   繁体   中英

Regex to capture groups and ignore last two characters where one is optional

I need to capture two groups from an input string. The values differ in structure as they come in.

The following are examples of the incoming strings:

Comment = "This is a comment";

NumericValue = 123456;

What I am trying to accomplish is to capture the string value from the left of the equals sign as one group and the value after the equals sign as a second group. The semicolon should never be included.

The caveat is that if the second group is a string, the quotes from each end must not be included in that capture group.

The expected results would be:

  1. Comment = "This is a comment";
    • key group => Comment
    • value group => This is a comment
  2. NumericValue = 123456;
    • key group => NumericValue
    • value group => 123456

The following is what I have so far. This works fine for capturing the numeric value, but leaves the end double quote when capturing the string value.

(?<key>\\w+)\\s*=\\s*(?:[\\"]?)(?<group>.+(?:(?=[\\"]?;)))

EDIT

When applying the regex against a string value, it must allow capture of semicolons and double quotes within the string and ignore only the closing ones.

So, if we have an input of:

Comment = "This is a "comment"; This is still a comment";

The second capture group should be:

This is a "comment"; This is still a comment

An option is to use an alternation where you would have to check for group 2 or group 3:

(?<key>\w+)\h*=\h*(?:"(.*?)"|([^"\r\n]+));$
  • (?<key>\\w+) Group key match 1+ word chars
  • \\h*=\\h* Match an = between optional horizontal whitespace chars
  • (?: Non capturing group
  • "(.+?)" Capture in group 2 1+ times any char between "
    • | Or
    • ([^"\\r\\n]+) Capture group 3, match 1+ times any char except " or a newline
  • ); Close non capturing group and match ;
  • $ End of string

Regex demo

In Java

String regex = "(?<key>\\w+)\\h*=\\h*(?:\"(.*?)\"|([^\"\\r\\n]+));$";

Edited based on comment to include ; and " in the comments as per the examples given:

(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>((")(?!;?$)|;(?!$)|[^;"])+)"?;?$

The following one additionally doesn't allow ; or " to appear in the numeric text. However, to include this, I had to rename the capturing groups because the name cannot be used for more than one group.

(?<key>\w+)\s*=\s*((?:")(?<valueT>((")(?!;?$)|;(?!$)|[^;"])+)";?$|(?<valueN>[^;"]+);?$)

Here is a class that tests it.

For readability, I have separated the key and value regexes in the class. I have added the test cases in a method within the class. However, this still doesn't handle the case of a numeric text containing ; or " . Also, the line needs to be trimmed before being subjected to the pattern test (which I think is feasible).

public class NameValuePairRegex{

    public static void main( String[] args ){
        String SPACE = "\\s*";
        String EQ = "=";
        String OR = "|";

        /* The original regex tried by you (for comparison). */
        String orig = "(?<key>\\w+)\\s*=\\s*(?:[\\\"]?)(?<value>.+(?:(?=;)))";

        String key = "(?<key>\\w+)";
        String valuePatternForText = "(?:\")(?<valueT>((\")(?!;?$)|;(?!$)|[^;\"])+)\";?$";
        String valuePatternForNumbers = "(?<valueN>[^;\"]+);?$";
        String p = key + SPACE + EQ + SPACE + "(" + valuePatternForText + OR + valuePatternForNumbers + ")";

        Pattern nvp = Pattern.compile( p );
        System.out.println( nvp.pattern() );
        print( input(), nvp );
    }

    private static void print( List<String> input, Pattern ep ) {
        for( String e : input ) {
            System.out.println( e );
            Matcher m = ep.matcher( e );
            boolean found = m.find();
            if( !found ) {
                System.out.println( "\t\tNo match" );
                continue;
            }

            String valueT = m.group( "valueT" );
            String valueN = m.group( "valueN" );

            System.out.print( "\t\t" + m.group( "key" ) + " -> " + ( valueT == null ? "" : valueT ) + " " + ( valueN == null ? "" : valueN ) );
            System.out.println(  );
        }

    }

    private static List<String> input(){
        List<String> neg = new ArrayList<>();
        Collections.addAll( neg, 
                "Comment = \"This is a comment\";",
                "Comment = \"This is a comment with semicolon ;\";", 
                "Comment = \"This is a comment with semicolon ; and quote\"\";",
                "Comment = \"This is a comment\"", 
                "Comment = \"This is a \"comment\"; This is still a comment\";",
                "NumericValue = 123456;",
                "NumericValue = 123;456;",
                "NumericValue = 123\"456;",
                "NumericValue = 123456" );

        return neg;
    }

}

Original answer:

The following changed regex is fulfilling the requirements you mentioned. I added the exclusion of ; and " from the value part.

Original that you tried:

(?<key>\w+)\s*=\s*(?:[\"]?)(?<group>.+(?:(?=[\"]?;)))

The changed one:

(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>[^;"]+)

Regular expressions are fun, but look how clean and easy to read this would be without using a regular expression:

int equals = s.indexOf('=');

String key = s.substring(0, equals).trim();

String value = s.substring(equals + 1).trim();
if (value.endsWith(";")) {
    value = value.substring(0, value.length() - 1).trim();
}
if (value.startsWith("\"") && value.endsWith("\"")) {
    value = value.substring(1, value.length() - 1);
}

Don't assume that because this uses more lines of code than a regular expression that it's slower. The lines of code executed internally by a regex engine will far exceed the above code.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM