简体   繁体   English

正则表达的麻烦,逃脱报价

[英]Regular expression troubles, escaped quotes

Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell 基本上,我正在传递一个字符串,我需要以与命令行选项由* nix shell标记的方式相同的方式对其进行标记。

Say I have the following string 说我有以下字符串

"Hello\" World" "Hello Universe" Hi

How could I turn it into a 3 element list 我怎么能把它变成3元素列表

  • Hello" World 你好,世界
  • Hello Universe 你好宇宙
  • Hi 你好

The following is my first attempt, but it's got a number of problems 以下是我的第一次尝试,但它有很多问题

  • It leaves the quote characters 它留下了引号字符
  • It doesn't catch the escaped quote 它没有抓住逃脱的报价

Code: 码:

public void test() {
    String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
    List<String> list = split(str);
}

public static List<String> split(String str) {
    Pattern pattern = Pattern.compile(
        "\"[^\"]*\"" + /* double quoted token*/
        "|'[^']*'" + /*single quoted token*/
        "|[A-Za-z']+" /*everything else*/
    );

    List<String> opts = new ArrayList<String>();
    Scanner scanner = new Scanner(str).useDelimiter(pattern);

    String token;
    while ((token = scanner.findInLine(pattern)) != null) {
        opts.add(token);
    }
    return opts;
}

So the incorrect output of the following code is 所以下面代码的错误输出是

  • "Hello\\" “你好\\”
  • World 世界
  • " " “”
  • Hello 你好
  • Universe 宇宙
  • Hi 你好

EDIT I'm totally open to a non regex solution. 编辑我对非正则表达式解决方案完全开放。 It's just the first solution that came to mind 这只是我想到的第一个解决方案

I'm pretty sure you can't do this by just tokenising on a regex. 我很确定你不能通过标记正则表达式来做到这一点。 If you need to deal with nested and escaped delimiters, you need to write a parser. 如果需要处理嵌套和转义分隔符,则需要编写解析器。 See eg http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html 参见例如http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html

There will be open source parsers which can do what you want, although I don't know any. 将会有开源解析器,它可以做你想要的,虽然我不知道。 You should also check out the StreamTokenizer class. 您还应该查看StreamTokenizer类。

If you decide you want to forego regex, and do parsing instead, there are a couple of options. 如果您决定放弃正则表达式,并进行解析,则有几种选择。 If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily: 如果您愿意仅使用双引号或单引号(但不是两者)作为引用,那么您可以使用StreamTokenizer轻松解决此问题:

public static List<String> tokenize(String s) throws IOException {
    List<String> opts = new ArrayList<String>();
    StreamTokenizer st = new StreamTokenizer(new StringReader(s));
    st.quoteChar('\"');
    while (st.nextToken() != StreamTokenizer.TT_EOF) {
        opts.add(st.sval);
    }

    return opts;
}

If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \\" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes): 如果你必须同时支持这两个引号,这里有一个天真的实现应该有用(告诫像''blah \\“blah”blah'这样的字符串会产生类似'blah“blahblah'的东西。如果不行,你需要做一些改变):

   public static List<String> splitSSV(String in) throws IOException {
        ArrayList<String> out = new ArrayList<String>();

        StringReader r = new StringReader(in);
        StringBuilder b = new StringBuilder();
        int inQuote = -1;
        boolean escape = false;
        int c;
        // read each character
        while ((c = r.read()) != -1) {
            if (escape) {  // if the previous char is escape, add the current char
                b.append((char)c);
                escape = false;
                continue;
            }
            switch (c) {
            case '\\':   // deal with escape char
                escape = true;
                break;
            case '\"':
            case '\'':  // deal with quote chars
                if (c == '\"' || c == '\'') {
                    if (inQuote == -1) {  // not in a quote
                        inQuote = c;  // now we are
                    } else {
                        inQuote = -1;  // we were in a quote and now we aren't
                    }
                }
                break;
            case ' ':
                if (inQuote == -1) {  // if we aren't in a quote, then add token to list
                    out.add(b.toString());
                    b.setLength(0);
                } else {
                    b.append((char)c); // else append space to current token
                }
                break;
            default:
                b.append((char)c);  // append all other chars to current token
            }
        }
        if (b.length() > 0) {
            out.add(b.toString()); // add final token to list
        }
        return out;
    }

To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash. 回顾一下,你想要在空格上拆分,除非用双引号括起来,前面没有反斜杠。

Step 1: tokenize the input: /([ \\t]+)|(\\\\")|(")|([^ \\t"]+)/ 第1步:将输入标记化: /([ \\t]+)|(\\\\")|(")|([^ \\t"]+)/

This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens. 这将为您提供一系列SPACE,ESCAPED_QUOTE,QUOTE和TEXT标记。

Step 2: build a finite state machine matching and reacting to the tokens: 第2步:构建一个有限状态机匹配并对令牌做出反应:

State: START 州:开始

  • SPACE -> return empty string SPACE - >返回空字符串
  • ESCAPED_QUOTE -> Error (?) ESCAPED_QUOTE - >错误(?)
  • QUOTE -> State := WITHIN_QUOTES QUOTE - > State:= WITHIN_QUOTES
  • TEXT -> return text TEXT - >返回文字

State: WITHIN_QUOTES 州:WITHIN_QUOTES

  • SPACE -> add value to accumulator SPACE - >为累加器添加值
  • ESCAPED_QUOTE -> add quote to accumulator ESCAPED_QUOTE - >将报价添加到累加器
  • QUOTE -> return and clear accumulator; QUOTE - >返回和清除累加器; State := START 状态:=开始
  • TEXT -> add text to accumulator TEXT - >将文本添加到累加器

Step 3: Profit!! 第3步:获利!!

I think if you use pattern like this: 我想如果你使用这样的模式:

Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");

Then it will give you desired output. 然后它会给你想要的输出。 When I ran with your input data I got this list: 当我使用您的输入数据运行时,我得到了这个列表:

["Hello\" World", "Hello Universe", Hi]


I used [A-Za-z']+ from your own question but shouldn't it be just : [A-Za-z]+ 我从你自己的问题中使用[A-Za-z']+但不应该只是: [A-Za-z]+

EDIT 编辑

Change your opts.add(token); 更改你的opts.add(token); line to: 到:

opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));

The first thing you need to do is stop thinking of the job in terms of split() . 你需要做的第一件事是停止以split()方式思考这项工作。 split() is meant for breaking down simple strings like this/that/the other , where / is always a delimiter. split()用于分解像this/that/the other简单字符串,其中/始终是分隔符。 But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes). 但是你试图拆分空格, 除非空格在引号内, 除非引号用反斜杠转义(如果反斜杠转义引号,它们可能会逃避其他事情,就像其他反斜杠一样)。

With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. 有了所有这些异常例外,就不可能创建一个正则表达式来匹配所有可能的分隔符,甚至不能使用像外观,条件,不情愿和占有量词等花哨的噱头。 What you want to do is match the tokens , not the delimiters. 你想要做的是匹配令牌 ,而不是分隔符。

In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. 在下面的代码中,用双引号或单引号括起来的标记可能包含空格以及引号字符(如果前面带有反斜杠)。 Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). 除了封闭引号之外的所有内容都在第1组(对于双引号)或第2组(单引号)中捕获。 Any character may be escaped with a backslash, even in non-quoted tokens; 即使在非引用的标记中,任何字符都可以使用反斜杠进行转义; the "escaping" backslashes are removed in a separate step. 在单独的步骤中删除“转义”反斜杠。

public static void test()
{
  String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
  List<String> commands = parseCommands(str);
  for (String s : commands)
  {
    System.out.println(s);
  }
}

public static List<String> parseCommands(String s)
{
  String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\""  // double-quoted
             + "|'((?:[^'\\\\]++|\\\\.)*+)'"    // single-quoted
             + "|\\S+";                         // not quoted
  Pattern p = Pattern.compile(rgx);
  Matcher m = p.matcher(s);
  List<String> commands = new ArrayList<String>();
  while (m.find())
  {
    String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
               : m.start(2) != -1 ? m.group(2) // strip single-quotes
               : m.group();
    cmd = cmd.replaceAll("\\\\(.)", "$1");  // remove escape characters
    commands.add(cmd);
  }
  return commands;
}

output: 输出:

Hello" World
Hello Universe
Hi

This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. 这与基于正则表达式的解决方案一样简单 - 它并不真正处理格式错误的输入,如不平衡的报价。 If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library. 如果你不熟练使用正则表达式,那么使用纯手工编码的解决方案或者更好的是专用的命令行解释器(CLI)库可能会更好。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM