简体   繁体   中英

Regex to Retrieve Quoted String and Quote Character

I have a language that defines a string as being delimited by either single or double quotes, where the delimiter is escaped within the string by doubling it. For example, all of the following are legal strings:

'This isn''t easy to parse.'
'Then John said, "Hello Tim!"'
"This isn't easy to parse."
"Then John said, ""Hello Tim!"""

I have a collection of strings (defined above), delimited by something that doesn't contain a quote. What I am attempting to do using regular expressions, is to parse each string in a list out. For example, here is an input:

"Some String #1" OR 'Some String #2' AND "Some 'String' #3" XOR
'Some "String" #4' HOWDY "Some ""String"" #5" FOO 'Some ''String'' #6'

The regular expression to determine whether a string is of such a form is trivial:

^(?:"(?:[^"]|"")*"|'(?:[^']|'')*')(?:\s+[^"'\s]+\s+(?:"(?:[^"]|"")*"|'(?:[^']|'')*')*

After running the above expression to test whether it is of such a form, I need another regular expression to get each delimited string from the input. I plan to do this as follows:

Pattern pattern = Pattern.compile("What REGEX goes here?");
Matcher matcher = pattern.matcher(inputString);
int startIndex = 0;
while (matcher.find(startIndex))
{
    String quote        = matcher.group(1);
    String quotedString = matcher.group(2);
    ...
    startIndex = matcher.end();
}

I would like a regular expression that captures the quote character in group #1, and the text within quotes in group #2 (I am using Java Regex). So, for the above input, I am looking for a regular expression that produces the following output within each loop iteration:

Loop 1: matcher.group(1) = "
        matcher.group(2) = Some String #1
Loop 2: matcher.group(1) = '
        matcher.group(2) = Some String #2
Loop 3: matcher.group(1) = "
        matcher.group(2) = Some 'String' #3
Loop 4: matcher.group(1) = '
        matcher.group(2) = Some "String" #4
Loop 5: matcher.group(1) = "
        matcher.group(2) = Some ""String"" #5
Loop 6: matcher.group(1) = '
        matcher.group(2) = Some ''String'' #6

Patterns I have tried thus far (un-escaped, followed by escaped for Java code):

(["'])((?:[^\1]|\1\1)*)\1
"([\"'])((?:[^\\1]|\\1\\1)*)\\1"

(?<quot>")(?<val>(?:[^"]|"")*)"|(?<quot>')(?<val>(?:[^']|'')*)'
"(?<quot>\")(?<val>(?:[^\"]|\"\")*)\"|(?<quot>')(?<val>(?:[^']|'')*)'"

Both of these fail when trying to compile the pattern.

Is such a regular expression possible?

Make a utility class that matches for you:

class test {
    private static Pattern pd = Pattern.compile("(\")((?:[^\"]|\"\")*)\"");
    private static Pattern ps = Pattern.compile("(')((?:[^']|'')*)'");
    public static Matcher match(String s) {
        Matcher md = pd.matcher(s);
        if (md.matches()) return md;
        else return ps.matcher(s);
    }
}

I'm not sure if this is what you're asking for, but you can just write some code to parse the string and get the desired results (quote character and inner text) instead of using a regular expression.

class Parser {

  public static ParseResult parse(String str)
  throws ParseException {

    if(str == null || (str.length() < 2)){
      throw new ParseException();
    }

    Character delimiter = getDelimiter(str);

    // Remove delimiters
    str = str.substring(1, str.length() -1);

    // Unescape escaped quotes in inner string
    String escapedDelim = "" + delimiter + delimiter;
    str = str.replaceAll(escapedDelim, "" + delimiter);

    return new ParseResult(delimiter, str);
  }

  private static Character getDelimiter(String str)
  throws ParseException {
    Character firstChar = str.charAt(0);
    Character lastChar = str.charAt(str.length() -1);

    if(!firstChar.equals(lastChar)){
      throw new ParseException(String.format(
            "First char (%s) doesn't match last char (%s) for string %s",
           firstChar, lastChar, str
      ));
    }

    return firstChar;
  }

}
class ParseResult {

  public final Character delimiter;
  public final String contents;

  public ParseResult(Character delimiter, String contents){
    this.delimiter = delimiter;
    this.contents = contents;
  }

}
class ParseException extends Exception {

  public ParseException(){
    super();
  }

  public ParseException(String msg){
    super(msg);
  }

}

Use this regex:

"^('|\")(.*)\\1$"

Some test code:

public static void main(String[] args) {
    String[] tests = {
            "'This isn''t easy to parse.'",
            "'Then John said, \"Hello Tim!\"'",
            "\"This isn't easy to parse.\"",
            "\"Then John said, \"\"Hello Tim!\"\"\""};
    Pattern pattern = Pattern.compile("^('|\")(.*)\\1$");
    Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find).forEach(m -> System.out.println("1=" + m.group(1) + ", 2=" + m.group(2)));
}

Output:

1=', 2=This isn''t easy to parse.
1=', 2=Then John said, "Hello Tim!"
1=", 2=This isn't easy to parse.
1=", 2=Then John said, ""Hello Tim!""

If you're interested on how to capturing the quoted text within text:

This regex matches all variants and captures the quote in group 1 and the quoted text in group 6:

^((')|("))(.*?("\3|")(.*)\5)?.*\1$

See live demo .


Here's some test code:

public static void main(String[] args) {
    String[] tests = {
            "'This isn''t easy to parse.'",
            "'Then John said, \"Hello Tim!\"'",
            "\"This isn't easy to parse.\"",
            "\"Then John said, \"\"Hello Tim!\"\"\""};
    Pattern pattern = Pattern.compile("^((')|(\"))(.*?(\"\\3|\")(.*)\\5)?.*\\1$");
    Arrays.stream(tests).map(pattern::matcher).filter(Matcher::find)
      .forEach(m -> System.out.println("quote=" + m.group(1) + ", quoted=" + m.group(6)));
}

Output:

quote=', quoted=null
quote=', quoted=Hello Tim!
quote=", quoted=null
quote=", quoted=Hello Tim!

Using regular expressions for this type of problem is very challenging. A simple parser that does not use regex is much easier to implement, understand, and maintain.

In addition, such a simple parse can easily support things like backslash escapes, and conversion of backslash sequences to characters (eg "\\n" conversion to a newline character).

This can be done very easily with a simple regex like below

private static Object[] checkPattern(String name, String regex) {
    List<String> matchedString = new ArrayList<>();
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(name);
    while (matcher.find()) {
        if (matcher.group().length() > 0) {
            matchedString.add(matcher.group());
        }
    }
    return matchedString.toArray();
}


@Test
public void quotedtextMultipleQuotedLines() {
    String text = "He said, \"I am Tom\". She said, \"I am Lisa\".";
    String quoteRegex = "(\"[^\"]+\")";
    String[] strArray = {"\"I am Tom\"", "\"I am Lisa\""};
    assertArrayEquals(strArray, checkPattern(text, quoteRegex));
}

We get the strings as the array elements here.

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM