[英]Regular expression troubles, escaped quotes
Basically, I'm being passed a string and I need to tokenise it in much the same manner as command line options are tokenised by a *nix shell 基本上,我正在传递一个字符串,我需要以与命令行选项由* nix shell标记的方式相同的方式对其进行标记。
Say I have the following string 说我有以下字符串
"Hello\" World" "Hello Universe" Hi
How could I turn it into a 3 element list 我怎么能把它变成3元素列表
The following is my first attempt, but it's got a number of problems 以下是我的第一次尝试,但它有很多问题
Code: 码:
public void test() {
String str = "\"Hello\\\" World\" \"Hello Universe\" Hi";
List<String> list = split(str);
}
public static List<String> split(String str) {
Pattern pattern = Pattern.compile(
"\"[^\"]*\"" + /* double quoted token*/
"|'[^']*'" + /*single quoted token*/
"|[A-Za-z']+" /*everything else*/
);
List<String> opts = new ArrayList<String>();
Scanner scanner = new Scanner(str).useDelimiter(pattern);
String token;
while ((token = scanner.findInLine(pattern)) != null) {
opts.add(token);
}
return opts;
}
So the incorrect output of the following code is 所以下面代码的错误输出是
EDIT I'm totally open to a non regex solution. 编辑我对非正则表达式解决方案完全开放。 It's just the first solution that came to mind 这只是我想到的第一个解决方案
I'm pretty sure you can't do this by just tokenising on a regex. 我很确定你不能通过标记正则表达式来做到这一点。 If you need to deal with nested and escaped delimiters, you need to write a parser. 如果需要处理嵌套和转义分隔符,则需要编写解析器。 See eg http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html 参见例如http://kore-nordmann.de/blog/do_NOT_parse_using_regexp.html
There will be open source parsers which can do what you want, although I don't know any. 将会有开源解析器,它可以做你想要的,虽然我不知道。 You should also check out the StreamTokenizer class. 您还应该查看StreamTokenizer类。
If you decide you want to forego regex, and do parsing instead, there are a couple of options. 如果您决定放弃正则表达式,并进行解析,则有几种选择。 If you are willing to have just a double quote or a single quote (but not both) as your quote, then you can use StreamTokenizer to solve this easily: 如果您愿意仅使用双引号或单引号(但不是两者)作为引用,那么您可以使用StreamTokenizer轻松解决此问题:
public static List<String> tokenize(String s) throws IOException {
List<String> opts = new ArrayList<String>();
StreamTokenizer st = new StreamTokenizer(new StringReader(s));
st.quoteChar('\"');
while (st.nextToken() != StreamTokenizer.TT_EOF) {
opts.add(st.sval);
}
return opts;
}
If you must support both quotes, here is a naive implementation that should work (caveat that a string like '"blah \\" blah"blah' will yield something like 'blah " blahblah'. If that isn't OK, you will need to make some changes): 如果你必须同时支持这两个引号,这里有一个天真的实现应该有用(告诫像''blah \\“blah”blah'这样的字符串会产生类似'blah“blahblah'的东西。如果不行,你需要做一些改变):
public static List<String> splitSSV(String in) throws IOException {
ArrayList<String> out = new ArrayList<String>();
StringReader r = new StringReader(in);
StringBuilder b = new StringBuilder();
int inQuote = -1;
boolean escape = false;
int c;
// read each character
while ((c = r.read()) != -1) {
if (escape) { // if the previous char is escape, add the current char
b.append((char)c);
escape = false;
continue;
}
switch (c) {
case '\\': // deal with escape char
escape = true;
break;
case '\"':
case '\'': // deal with quote chars
if (c == '\"' || c == '\'') {
if (inQuote == -1) { // not in a quote
inQuote = c; // now we are
} else {
inQuote = -1; // we were in a quote and now we aren't
}
}
break;
case ' ':
if (inQuote == -1) { // if we aren't in a quote, then add token to list
out.add(b.toString());
b.setLength(0);
} else {
b.append((char)c); // else append space to current token
}
break;
default:
b.append((char)c); // append all other chars to current token
}
}
if (b.length() > 0) {
out.add(b.toString()); // add final token to list
}
return out;
}
To recap, you want to split on whitespace, except when surrounded by double quotes, which are not preceded by a backslash. 回顾一下,你想要在空格上拆分,除非用双引号括起来,前面没有反斜杠。
/([ \\t]+)|(\\\\")|(")|([^ \\t"]+)/
第1步:将输入标记化: /([ \\t]+)|(\\\\")|(")|([^ \\t"]+)/
This gives you a sequence of SPACE, ESCAPED_QUOTE, QUOTE and TEXT tokens. 这将为您提供一系列SPACE,ESCAPED_QUOTE,QUOTE和TEXT标记。
I think if you use pattern like this: 我想如果你使用这样的模式:
Pattern pattern = Pattern.compile("\".*?(?<!\\\\)\"|'.*?(?<!\\\\)'|[A-Za-z']+");
Then it will give you desired output. 然后它会给你想要的输出。 When I ran with your input data I got this list: 当我使用您的输入数据运行时,我得到了这个列表:
["Hello\" World", "Hello Universe", Hi]
I used [A-Za-z']+
from your own question but shouldn't it be just : [A-Za-z]+
我从你自己的问题中使用[A-Za-z']+
但不应该只是: [A-Za-z]+
Change your opts.add(token);
更改你的opts.add(token);
line to: 到:
opts.add(token.replaceAll("^\"|\"$|^'|'$", ""));
The first thing you need to do is stop thinking of the job in terms of split()
. 你需要做的第一件事是停止以split()
方式思考这项工作。 split()
is meant for breaking down simple strings like this/that/the other
, where /
is always a delimiter. split()
用于分解像this/that/the other
简单字符串,其中/
始终是分隔符。 But you're trying to split on whitespace, unless the whitespace is within quotes, except if the quotes are escaped with backslashes (and if backslashes escape quotes, they probably escape other things, like other backslashes). 但是你试图拆分空格, 除非空格在引号内, 除非引号用反斜杠转义(如果反斜杠转义引号,它们可能会逃避其他事情,就像其他反斜杠一样)。
With all those exceptions-to-exceptions, it's just not possible to create a regex to match all possible delimiters, not even with fancy gimmicks like lookarounds, conditionals, reluctant and possessive quantifiers. 有了所有这些异常例外,就不可能创建一个正则表达式来匹配所有可能的分隔符,甚至不能使用像外观,条件,不情愿和占有量词等花哨的噱头。 What you want to do is match the tokens , not the delimiters. 你想要做的是匹配令牌 ,而不是分隔符。
In the following code, a token that's enclosed in double-quotes or single-quotes may contain whitespace as well as the quote character if it's preceded by a backslash. 在下面的代码中,用双引号或单引号括起来的标记可能包含空格以及引号字符(如果前面带有反斜杠)。 Everything except the enclosing quotes is captured in group #1 (for double-quoted tokens) or group #2 (single-quoted). 除了封闭引号之外的所有内容都在第1组(对于双引号)或第2组(单引号)中捕获。 Any character may be escaped with a backslash, even in non-quoted tokens; 即使在非引用的标记中,任何字符都可以使用反斜杠进行转义; the "escaping" backslashes are removed in a separate step. 在单独的步骤中删除“转义”反斜杠。
public static void test()
{
String str = "\"Hello\\\" World\" 'Hello Universe' Hi";
List<String> commands = parseCommands(str);
for (String s : commands)
{
System.out.println(s);
}
}
public static List<String> parseCommands(String s)
{
String rgx = "\"((?:[^\"\\\\]++|\\\\.)*+)\"" // double-quoted
+ "|'((?:[^'\\\\]++|\\\\.)*+)'" // single-quoted
+ "|\\S+"; // not quoted
Pattern p = Pattern.compile(rgx);
Matcher m = p.matcher(s);
List<String> commands = new ArrayList<String>();
while (m.find())
{
String cmd = m.start(1) != -1 ? m.group(1) // strip double-quotes
: m.start(2) != -1 ? m.group(2) // strip single-quotes
: m.group();
cmd = cmd.replaceAll("\\\\(.)", "$1"); // remove escape characters
commands.add(cmd);
}
return commands;
}
output: 输出:
Hello" World Hello Universe Hi
This is about as simple as it gets for a regex-based solution--and it doesn't really deal with malformed input, like unbalanced quotes. 这与基于正则表达式的解决方案一样简单 - 它并不真正处理格式错误的输入,如不平衡的报价。 If you're not fluent in regexes, you might be better off with a purely hand-coded solution or, even better, a dedicated command-line interpreter (CLI) library. 如果你不熟练使用正则表达式,那么使用纯手工编码的解决方案或者更好的是专用的命令行解释器(CLI)库可能会更好。
声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.