I need to capture two groups from an input string. The values differ in structure as they come in.
The following are examples of the incoming strings:
Comment = "This is a comment";
NumericValue = 123456;
What I am trying to accomplish is to capture the string value from the left of the equals sign as one group and the value after the equals sign as a second group. The semicolon should never be included.
The caveat is that if the second group is a string, the quotes from each end must not be included in that capture group.
The expected results would be:
The following is what I have so far. This works fine for capturing the numeric value, but leaves the end double quote when capturing the string value.
(?<key>\\w+)\\s*=\\s*(?:[\\"]?)(?<group>.+(?:(?=[\\"]?;)))
EDIT
When applying the regex against a string value, it must allow capture of semicolons and double quotes within the string and ignore only the closing ones.
So, if we have an input of:
Comment = "This is a "comment"; This is still a comment";
The second capture group should be:
This is a "comment"; This is still a comment
An option is to use an alternation where you would have to check for group 2 or group 3:
(?<key>\w+)\h*=\h*(?:"(.*?)"|([^"\r\n]+));$
(?<key>\\w+)
Group key
match 1+ word chars \\h*=\\h*
Match an =
between optional horizontal whitespace chars (?:
Non capturing group "(.+?)"
Capture in group 2 1+ times any char between "
|
Or([^"\\r\\n]+)
Capture group 3, match 1+ times any char except "
or a newline );
Close non capturing group and match ;
$
End of string In Java
String regex = "(?<key>\\w+)\\h*=\\h*(?:\"(.*?)\"|([^\"\\r\\n]+));$";
Edited based on comment to include ;
and "
in the comments as per the examples given:
(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>((")(?!;?$)|;(?!$)|[^;"])+)"?;?$
The following one additionally doesn't allow ;
or "
to appear in the numeric text. However, to include this, I had to rename the capturing groups because the name cannot be used for more than one group.
(?<key>\w+)\s*=\s*((?:")(?<valueT>((")(?!;?$)|;(?!$)|[^;"])+)";?$|(?<valueN>[^;"]+);?$)
Here is a class that tests it.
For readability, I have separated the key
and value
regexes in the class. I have added the test cases in a method within the class. However, this still doesn't handle the case of a numeric text containing ;
or "
. Also, the line needs to be trimmed before being subjected to the pattern test (which I think is feasible).
public class NameValuePairRegex{
public static void main( String[] args ){
String SPACE = "\\s*";
String EQ = "=";
String OR = "|";
/* The original regex tried by you (for comparison). */
String orig = "(?<key>\\w+)\\s*=\\s*(?:[\\\"]?)(?<value>.+(?:(?=;)))";
String key = "(?<key>\\w+)";
String valuePatternForText = "(?:\")(?<valueT>((\")(?!;?$)|;(?!$)|[^;\"])+)\";?$";
String valuePatternForNumbers = "(?<valueN>[^;\"]+);?$";
String p = key + SPACE + EQ + SPACE + "(" + valuePatternForText + OR + valuePatternForNumbers + ")";
Pattern nvp = Pattern.compile( p );
System.out.println( nvp.pattern() );
print( input(), nvp );
}
private static void print( List<String> input, Pattern ep ) {
for( String e : input ) {
System.out.println( e );
Matcher m = ep.matcher( e );
boolean found = m.find();
if( !found ) {
System.out.println( "\t\tNo match" );
continue;
}
String valueT = m.group( "valueT" );
String valueN = m.group( "valueN" );
System.out.print( "\t\t" + m.group( "key" ) + " -> " + ( valueT == null ? "" : valueT ) + " " + ( valueN == null ? "" : valueN ) );
System.out.println( );
}
}
private static List<String> input(){
List<String> neg = new ArrayList<>();
Collections.addAll( neg,
"Comment = \"This is a comment\";",
"Comment = \"This is a comment with semicolon ;\";",
"Comment = \"This is a comment with semicolon ; and quote\"\";",
"Comment = \"This is a comment\"",
"Comment = \"This is a \"comment\"; This is still a comment\";",
"NumericValue = 123456;",
"NumericValue = 123;456;",
"NumericValue = 123\"456;",
"NumericValue = 123456" );
return neg;
}
}
Original answer:
The following changed regex is fulfilling the requirements you mentioned. I added the exclusion of ;
and "
from the value part.
Original that you tried:
(?<key>\w+)\s*=\s*(?:[\"]?)(?<group>.+(?:(?=[\"]?;)))
The changed one:
(?<key>\w+)\s*=\s*(?:[\"]?)(?<value>[^;"]+)
Regular expressions are fun, but look how clean and easy to read this would be without using a regular expression:
int equals = s.indexOf('=');
String key = s.substring(0, equals).trim();
String value = s.substring(equals + 1).trim();
if (value.endsWith(";")) {
value = value.substring(0, value.length() - 1).trim();
}
if (value.startsWith("\"") && value.endsWith("\"")) {
value = value.substring(1, value.length() - 1);
}
Don't assume that because this uses more lines of code than a regular expression that it's slower. The lines of code executed internally by a regex engine will far exceed the above code.
The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.