简体   繁体   中英

How to split a line with priority regex/pattern matching, Java

So I know this question has probably been asked a bunch of times before, but I'm essentially trying to do the same thing as the JVM when it looks at run arguments on the command line, eg:

java MyProgram arg1 arg2 "argument three" arg4

The priority match is if the argument is in quotes, treat it as one argument; otherwise, separate them by spaces.

I'm reading through a CSV file, but sometimes one section is contained in quotes, so it might look something like this:

value, value, value, value, "value, value", value

Thus it adds one more element to the returned array from String.split().

The regex I'm trying to use:

String[] data = line.split("(\".*\")|,", -1);

So essentially I'm trying to say, if there's a double quote followed by anything, followed by another quote, treat that as priority (left - right); otherwise, split it based on the comma.

That regex doesn't seem to be working though, because I still get one more value on that line than there are fields (headers) in the file.

Any help would be appreciated, I'm not the best with regex. Thanks.

You can do the following (matches strings using delimiters as space and comma and ignores delimiters inside quotes..same problem different approach):

List<String> matchList = new ArrayList<String>();
Pattern regex = Pattern.compile("[^\\s,\\\"']+|\\\"([^\\\"]*)\\\"|'([^']*)'");
Matcher regexMatcher = regex.matcher(line);
while (regexMatcher.find()) {
    matchList.add(regexMatcher.group());
}

Edit: You can use [^\\\\s,\\\\\\"]+|\\\\\\"([^\\\\\\"]*)\\\\\\" for allowing only double quotes (as suggested by uraimo).

Output:

[value, value, value, value, "value, value", value]

You are looking for either:

  • the start of the string or a comma (?:^|,) followed by zero-or-more whitespaces \\s* followed by a quote " then any number of non-quote characters ([^"]*) then another quote " then zero-or-more whitespace \\s* and either a trailing comma or the end-of-line (?=,|$) - which when combined gives (?:^|,)\\s*"([^"]*)"\\s*(?=,|$) or
  • the start of the string or a comma (?:^|,) followed by zero-or-more non-comma characters ([^,]*) and either a trailing comma or the end-of-line (?=,|$) which when combined gives (?:^|,)([^,]*)(?=,|$)

Putting the two together you get the regular expression:

(?:^|,)(?:\s*"([^"]*)"\s*|([^,]*))(?=,|$)

And you can implement it like this:

String test = "value, value, value, value, \"value, value\", value";

Pattern pattern = Pattern.compile( "(?:^|,)(?:\\s*\"([^\"]*)\"\\s*|([^,]*))(?=,|$)" );
Matcher matcher = pattern.matcher( test );
while( matcher.find() ){
    String value = matcher.group(1);
    if ( value == null )
        value = matcher.group(2).trim();
    System.out.println( value );
}

If you want to expand it to allowing escaped quotes in the quoted string then you want:

(?:^|,)(?:\s*"((?:[^"]|\\")*)"\s*|([^,]*))(?=,|$)

Which can be written, in Java, as:

Pattern pattern = Pattern.compile( "(?:^|,)(?:\\s*\"((?:[^\"]|\\\\\")*)\"\\s*|([^,]*))(?=,|$)" );

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM