简体   繁体   English

正则表达式解析带有转义字符的字符串

[英]regex to parse string with escaped characters

I am reading information out of a formatted string. 我正在从格式化的字符串中读取信息。 The format looks like this: 格式如下:

"foo:bar:beer:123::lol"

Everything between the ":" is data I want to extract with regex. “:”之间的所有内容都是我想用正则表达式提取的数据。 If a : is followed by another : (like "::") the data for this has to be "" (an empty string). 如果a:后跟另一个:(如“::”),则此数据必须为“”(空字符串)。

Currently I am parsing it with this regex: 目前我用这个正则表达式解析它:

(.*?)(:|$)

Now it came to my mind that ":" may exist within the data, as well. 现在我想到了数据中也可能存在“:”。 So it has to be escaped. 所以它必须被逃脱。 Example: 例:

"foo:bar:beer:\::1337"

How can I change my regular expression so that it matches the "\\:" as data, too? 如何更改正则表达式以使其与“\\:”匹配作为数据呢?

Edit: I am using JavaScript as programming language. 编辑:我使用JavaScript作为编程语言。 It seems to have some limitations regarding complex regulat expressions. 它似乎对复杂的规则表达有一些限制。 The solution should work in JavaScript, as well. 该解决方案也应该在JavaScript中运行。

Thanks, McFarlane 谢谢,麦克法兰

var myregexp = /((?:\\.|[^\\:])*)(?::|$)/g;
var match = myregexp.exec(subject);
while (match != null) {
    for (var i = 0; i < match.length; i++) {
        // Add match[1] to the list of matches
    }
    match = myregexp.exec(subject);
}

Input: "foo:bar:beer:\\\\:::1337" 输入: "foo:bar:beer:\\\\:::1337"

Output: ["foo", "bar", "beer", "\\\\:", "", "1337", ""] 输出: ["foo", "bar", "beer", "\\\\:", "", "1337", ""]

You'll always get an empty string as the last match. 你总是得到一个空字符串作为最后一个匹配。 This is unavoidable given the requirement that you also want empty strings to match between delimiters (and the lack of lookbehind assertions in JavaScript). 考虑到您还希望空字符串在分隔符之间匹配(以及JavaScript中缺少lookbehind断言),这是不可避免的。

Explanation: 说明:

(          # Match and capture:
 (?:       # Either match...
  \\.      # an escaped character
 |         # or
  [^\\:]   # any character except backslash or colon
 )*        # zero or more times
)          # End of capturing group
(?::|$)    # Match (but don't capture) a colon or end-of-string

Use a negative lookbehind assertion. 使用负面的lookbehind断言。

(.*?)((?<!\\):|$)

This will only match : if it's not preceded by \\ . 这只会匹配:如果它之前没有 \\

Here's a solution: 这是一个解决方案:

function tokenize(str) {
  var reg = /((\\.|[^\\:])*)/g;
  var array = [];
  while(reg.lastIndex < str.length) {
    match = reg.exec(str);
    array.push(match[0].replace(/\\(\\|:)/g, "$1"));
    reg.lastIndex++;
  }
  return array;
}

It splits a string into token depending on the : character. 它根据:字符将字符串拆分为令牌。

  • But you can escape the : character with \\ if you want it to be part of a token. 但是如果你希望它成为令牌的一部分,你可以使用\\来转义:字符。
  • you can escape the \\ with \\ if you want it to be part of a token 如果你希望它成为令牌的一部分,你可以使用\\来转义\\
  • any other \\ won't be interpreted. 任何其他\\将不会被解释。 (ie: \\a remains \\a ) (即: \\a遗骸\\a
  • So you can put any data in your tokens provided that data is correctly formatted before hand. 因此,只要数据格式正确,您就可以将任何数据放入令牌中。

Here is an example with the string \\a:b:\\n::\\\\:\\::x , which should give these token: \\a , b , \\n , <empty string> , \\ , : , x . 下面是一个字符串\\a:b:\\n::\\\\:\\::x的示例,它应该提供以下标记: \\ab\\n<empty string>\\:x

>>> tokenize("\\a:b:\\n::\\\\:\\::x");
["\a", "b", "\n", "", "\", ":", "x"]

In an attempt to be clearer: the string put into the tokenizer will be interpreted, it has 2 special character: \\ and : 为了更清楚:放入tokenizer的字符串将被解释,它有2个特殊字符: \\:

  • \\ will only have a special meaning only if followed by \\ or : , and will effectively "escape" these character: meaning that they will loose their special meaning for tokenizer, and they'll be considered as any normal character (and thus will be part of tokens). \\只有在跟着\\:后才会有特殊意义,并且会有效地“逃避”这些字符:这意味着它们将失去它们对于标记化器的特殊含义,并且它们将被视为任何正常字符(因此将是部分代币)。
  • : is the marker separating 2 tokens. :是分隔2个标记的标记。

I realize the OP didn't ask for slash escaping, but other viewers could need a complete parsing library allowing any character in data. 我意识到OP没有要求斜线转义,但其他观众可能需要一个完整的解析库来允许数据中的任何字符。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM