简体   繁体   中英

Replacing spaces within quotes

I'm really struggling with regex here. Using Java how would I go about replacing all spaces within quotes (double quotes really) with another character (or escaped space "\\ " ) but ONLY if the phrase ends with a wildcard character.

word1 AND "word2 word3 word4*" OR "word5 word6" OR word7

to

word1 AND "word2\ word3\ word4*" OR "word5 word6" OR word7

Do you really need regular expressions here? The task seems well-described, but a little too complex for regular expressions. So I would rather program it out explicitly.

package so4478038;

import static org.junit.Assert.*;

import org.junit.Test;

public class QuoteSpaces {

  public static String escapeSpacesInQuotes(String input) {
    StringBuilder sb = new StringBuilder();
    StringBuilder quotedWord = new StringBuilder();
    boolean inQuotes = false;
    for (int i = 0, imax = input.length(); i < imax; i++) {
      char c = input.charAt(i);
      if (c == '"') {
        if (!inQuotes) {
          quotedWord.setLength(0);
        } else {
          String qw = quotedWord.toString();
          if (qw.endsWith("*")) {
            sb.append(qw.replace(" ", "\\ "));
          } else {
            sb.append(qw);
          }
        }
        inQuotes = !inQuotes;
      }
      if (inQuotes) {
        quotedWord.append(c);
      } else {
        sb.append(c);
      }
    }
    return sb.toString();
  }

  @Test
  public void test() {
    assertEquals("word1 AND \"word2\\ word3\\ word4*\" OR \"word5 word6\" OR word7", escapeSpacesInQuotes("word1 AND \"word2 word3 word4*\" OR \"word5 word6\" OR word7"));
  }
}

I think the best solution is to use a regular expression to find the quoted strings you want, and then to replace the spaces within the regex's match. Something like this:

import java.util.regex.*;

class SOReplaceSpacesInQuotes {
  public static void main(String[] args) {
    Pattern findQuotes = Pattern.compile("\"[^\"]+\\*\"");

    for (String arg : args) {
      Matcher m = findQuotes.matcher(arg);

      StringBuffer result = new StringBuffer();
      while (m.find())
        m.appendReplacement(result, m.group().replace(" ", "\\\\ "));
      m.appendTail(result);

      System.out.println(arg + " -> " + result.toString());
    }
  }
}

Running java SOReplaceSpacesInQuotes 'word1 AND "word2 word3 word4*" OR "word5 word6*" OR word7' then happily produced the output word1 AND "word2 word3 word4*" OR "word5 word6*" OR word7 -> word1 AND "word2\\ word3\\ word4*" OR "word5\\ word6*" OR word7 , which is exactly what you wanted.

The pattern is "[^"]+\\*" , but backslashes and quotes have to be escaped for Java. This matches a literal quote, any number of non-quotes, a * , and a quote, which is what you want. This assumes that (a) you aren't allowed to have embedded \\" escape sequences, and (b) that * is the only wildcard. If you have embedded escape sequences, then use "([^\\\\"]|\\\\.)\\*" (which, escaped for Java, is \\"([^\\\\\\\\\\\\"]|\\\\\\\\.)\\\\*\\" ); if you have multiple wildcards, use "[^"]+[*+]" ; and if you have both, combine them in the obvious way. Dealing with multiple wildcards is a matter of just letting any of them match at the end of the string; dealing with escape sequences is done by matching a quote followed by any number of non-backslash, non-quote characters, or a backslash preceding anything at all.

Now, that pattern finds the quoted strings you want. For each argument to the program, we then match all of them, and using m.group().replace(" ", "\\\\\\\\ ") , replace each space in what was matched (the quoted string) with a backslash and a space. (This string is \\\\ —why two real backslashes are required, I'm not sure.) If you haven't seen appendReplacement and appendTail before (I hadn't), here's what they do: in tandem, they iterate through the entire string, replacing whatever was matched with the second argument to appendReplacement , and appending it all to the given StringBuffer . The appendTail call is necessary to catch whatever didn't match at the end. The documentation for Matcher.appendReplacement(StringBuffer,String) contains a good example of their use.


Edit: As Roland Illig pointed out, this is problematic if certain kinds of invalid input can appear, such as a AND "b" AND *"c" , which would become a AND "b"\\ AND\\ *"c" . If this is a danger (or if it could possibly become a danger in the future, which it likely could), then you should make it more robust by always matching quotes, but only replacing if they ended in a wildcard character. This will work as long as your quotes are always appropriately paired, which is a much weaker assumption. The resulting code is very similar:

import java.util.regex.*;

class SOReplaceSpacesInQuotes {
  public static void main(String[] args) {
    Pattern findQuotes = Pattern.compile("\"[^\"]+?(\\*)?\"");

    for (String arg : args) {
      Matcher m = findQuotes.matcher(arg);

      StringBuffer result = new StringBuffer();
      while (m.find()) {
        if (m.group(1) == null)
          m.appendReplacement(result, m.group());
        else
          m.appendReplacement(result, m.group().replace(" ", "\\\\ "));
      }
      m.appendTail(result);

      System.out.println(arg + " -> " + result.toString());
    }
  }
}

We put the wildcard character in a group, and make it optional, and make the body of the quotes reluctant with +? , so that it will match as little as possible and let the wildcard character get grouped. This way, we match each successive pair of quotes, and since the regex engine won't restart in the middle of a match, we'll only ever match the insides, not the outsides, of quotes. But now we don't always want to replace the spaces—we only want to do so if there was a wildcard character. This is easy: test to see if group 1 is null . If it is, then there wasn't a wildcard character, so replace the string with itself. Otherwise, replace the spaces. And indeed, java SOReplaceSpacesInQuotes 'a AND "bd" AND *"cd"' yields the desired a AND "bd" AND *"cd" -> a AND "bd" AND *"cd" , while java SOReplaceSpacesInQuotes 'a AND "bd" AND "cd*"' performs a substitution to get a AND "bd" AND *"cd" -> a AND "bd" AND "c\\ *d" .

Does it work ?

str.replaceAll("\"", "\\");

I don't have IDE now and I don't test it

The technical post webpages of this site follow the CC BY-SA 4.0 protocol. If you need to reprint, please indicate the site URL or the original address.Any question please contact:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM