简体   繁体   English

使用正则表达式在 C# 中查找带转义引号的带引号的字符串

[英]Finding quoted strings with escaped quotes in C# using a regular expression

I'm trying to find all of the quoted text on a single line.我试图在一行中找到所有引用的文本。

Example:例子:

"Some Text"
"Some more Text"
"Even more text about \"this text\""

I need to get:我需要得到:

  • "Some Text"
  • "Some more Text"
  • "Even more text about \\"this text\\""

\\"[^\\"\\r]*\\" gives me everything except for the last one, because of the escaped quotes. \\"[^\\"\\r]*\\"给了我除了最后一个之外的所有东西,因为有转义的引号。

I have read about \\"[^\\"\\\\]*(?:\\\\.[^\\"\\\\]*)*\\" working, but I get an error at run time:我已经读过\\"[^\\"\\\\]*(?:\\\\.[^\\"\\\\]*)*\\"工作,但在运行时出现错误:

parsing ""[^"\]*(?:\.[^"\]*)*"" - Unterminated [] set.

How do I fix this?我该如何解决?

What you've got there is an example of Friedl's "unrolled loop" technique, but you seem to have some confusion about how to express it as a string literal.您所拥有的是 Friedl 的“展开循环”技术的示例,但您似乎对如何将其表示为字符串文字有些困惑。 Here's how it should look to the regex compiler:下面是它应该如何看待正则表达式编译器:

"[^"\\]*(?:\\.[^"\\]*)*"

The initial "[^"\\\\]* matches a quotation mark followed by zero or more of any characters other than quotation marks or backslashes.开头的"[^"\\\\]*匹配一个引号,后跟零个或多个除引号或反斜杠以外的任何字符。 That part alone, along with the final " , will match a simple quoted string with no embedded escape sequences, like "this" or "" .单独的那部分,连同最后的" ,将匹配一个没有嵌入转义序列的简单引用字符串,如"this"""

If it does encounter a backslash, \\\\.如果它确实遇到反斜杠, \\\\. consumes the backslash and whatever follows it, and [^"\\\\]* (again) consumes everything up to the next backslash or quotation mark. That part gets repeated as many times as necessary until an unescaped quotation mark turns up (or it reaches the end of the string and the match attempt fails).消耗反斜杠及其后的任何内容,并且[^"\\\\]* (再次)消耗下一个反斜杠或引号之前的所有内容。该部分会根据需要重复多次,直到出现未转义的引号(或到达字符串的结尾和匹配尝试失败)。

Note that this will match "foo\\"- in \\"foo\\"-"bar" .请注意,这将匹配"foo\\"-\\"foo\\"-"bar" That may seem to expose a flaw in the regex, but it doesn't;这似乎暴露了正则表达式中的一个缺陷,但事实并非如此; it's the input that's invalid.这是无效的输入 The goal was to match quoted strings, optionally containing backslash-escaped quotes, embedded in other text--why would there be escaped quotes outside of quoted strings?目标是匹配带引号的字符串,可以选择包含反斜杠转义的引号,嵌入在其他文本中——为什么带引号的字符串之外会有转义的引号? If you really need to support that, you have a much more complex problem, requiring a very different approach.如果你真的需要支持它,你就会遇到一个更复杂的问题,需要一种非常不同的方法。

As I said, the above is how the regex should look to the regex compiler.正如我所说,上面是正则表达式应该如何看待正则表达式编译器。 But you're writing it in the form of a string literal, and those tend to treat certain characters specially--ie, backslashes and quotation marks.但是您以字符串文字的形式编写它,并且那些倾向于特殊对待某些字符——即反斜杠和引号。 Fortunately, C#'s verbatim strings save you the hassle of having to double-escape backslashes;幸运的是,C# 的逐字字符串为您省去了双重转义反斜杠的麻烦; you just have to escape each quotation mark with another quotation mark:你只需要用另一个引号来转义每个引号:

Regex r = new Regex(@"""[^""\\]*(?:\\.[^""\\]*)*""");

So the rule is double quotation marks for the C# compiler and double backslashes for the regex compiler--nice and easy.所以规则是 C# 编译器的双引号和正则表达式编译器的双反斜杠——既好又容易。 This particular regex may look a little awkward, with the three quotation marks at either end, but consider the alternative:这个特殊的正则表达式可能看起来有点尴尬,两端都有三个引号,但请考虑替代方案:

Regex r = new Regex("\"[^\"\\\\]*(?:\\\\.[^\"\\\\]*)*\"");

In Java, you always have to write them that way.在 Java 中,您总是必须以这种方式编写它们。 :-( :-(

Regex for capturing strings (with \\ for character escaping), for the .NET engine:用于捕获字符串的正则表达式(使用\\用于字符转义),用于 .NET 引擎:

(?>(?(STR)(?(ESC).(?<-ESC>)|\\(?<ESC>))|(?!))|(?(STR)"(?<-STR>)|"(?<STR>))|(?(STR).|(?!)))+   

Here, a "friendly" version:在这里,一个“友好”的版本:

(?>                            | especify nonbacktracking
   (?(STR)                     | if (STRING MODE) then
         (?(ESC)               |     if (ESCAPE MODE) then
               .(?<-ESC>)      |          match any char and exits escape mode (pop ESC)
               |               |     else
               \\(?<ESC>)      |          match '\' and enters escape mode (push ESC)
         )                     |     endif
         |                     | else
         (?!)                  |     do nothing (NOP)
   )                           | endif
   |                           | -- OR
   (?(STR)                     | if (STRING MODE) then
         "(?<-STR>)            |     match '"' and exits string mode (pop STR)
         |                     | else
         "(?<STR>)             |     match '"' and enters string mode (push STR)
   )                           | endif
   |                           | -- OR
   (?(STR)                     | if (STRING MODE) then
         .                     |     matches any character
         |                     | else
         (?!)                  |     do nothing (NOP)  
   )                           | endif
)+                             | REPEATS FOR EVERY CHARACTER

Based on http://tomkaminski.com/conditional-constructs-net-regular-expressions examples.基于http://tomkaminski.com/conditional-constructs-net-regular-expressions示例。 It relies in quotes balancing.它依赖于报价平衡。 I use it with great success.我使用它取得了巨大的成功。 Use it with Singleline flag.将它与Singleline标志一起使用。

To play around with regexes, I recommend Rad Software Regular Expression Designer , which has a nice "Language Elements" tab with quick access to some basic instructions.要使用正则表达式,我推荐Rad Software Regular Expression Designer ,它有一个很好的“语言元素”选项卡,可以快速访问一些基本说明。 It's based at .NET's regex engine.它基于 .NET 的正则表达式引擎。

"(\\"|\\\\|[^"\\])*"

should work.应该管用。 Match either an escaped quote, an escaped backslash, or any other character except a quote or backslash character.匹配转义引号、转义反斜杠或除引号或反斜杠字符之外的任何其他字符。 Repeat.重复。

In C#:在 C# 中:

StringCollection resultList = new StringCollection();
Regex regexObj = new Regex(@"""(\\""|\\\\|[^""\\])*""");
Match matchResult = regexObj.Match(subjectString);
while (matchResult.Success) {
    resultList.Add(matchResult.Value);
    matchResult = matchResult.NextMatch();
} 

Edit: Added escaped backslash to the list to correctly handle "This is a test\\\\" .编辑:在列表中添加了转义反斜杠以正确处理"This is a test\\\\"

Explanation:解释:

First match a quote character.首先匹配一个引号字符。

Then the alternatives are evaluated from left to right.然后从左到右评估备选方案。 The engine first tries to match an escaped quote.引擎首先尝试匹配转义引号。 If that doesn't match, it tries an escaped backslash.如果不匹配,它会尝试转义反斜杠。 That way, it can distinguish between "Hello \\" string continues" and "String ends here \\\\" .这样,它就可以区分"Hello \\" string continues""String ends here \\\\"

If either don't match, then anything else is allowed except for a quote or backslash character.如果两者都不匹配,则除了引号或反斜杠字符外,允许使用任何其他字符。 Then repeat.然后重复。

Finally, match the closing quote.最后,匹配结束语。

I recommend getting RegexBuddy .我建议获取RegexBuddy It lets you play around with it until you make sure everything in your test set matches.它可以让您玩弄它,直到您确保测试集中的所有内容都匹配为止。

As for your problem, I would try four /'s instead of two:至于你的问题,我会尝试四个 / 而不是两个:

\"[^\"\\\\]*(?:\\.[^\"\\\\]*)*\"

Well, Alan Moore's answer is good, but I would modify it a bit to make it more compact.好吧,艾伦摩尔的回答很好,但我会稍微修改一下以使其更紧凑。 For the regex compiler:对于正则表达式编译器:

"([^"\\]*(\\.)*)*"

Compare with Alan Moore's expression:与艾伦摩尔的表达进行比较:

"[^"\\]*(\\.[^"\\]*)*"

The explanation is very similar to Alan Moore's one:解释与 Alan Moore 的解释非常相似:

The first part " matches a quotation mark.第一部分"匹配引号。

The second part [^"\\\\]* matches zero or more of any characters other than quotation marks or backslashes.第二部分[^"\\\\]*匹配零个或多个除引号或反斜杠之外的任何字符。

And the last part (\\\\.)* matches backslash and whatever single character follows it.最后一部分(\\\\.)*匹配反斜杠及其后的任何单个字符。 Pay attention on the *, saying that this group is optional.注意*,表示这个组是可选的。

The parts described, along with the final " (ie "[^"\\\\]*(\\\\.)*" ), will match: "Some Text" and "Even more Text\\"", but will not match: "Even more text about \\"this text\\"".所描述的部分以及最后的" (即"[^"\\\\]*(\\\\.)*" ) 将匹配:"Some Text" 和 "Even more Text\\"",但不会匹配:"更多关于\\"this text\\""的文本。

To make it possible, we need the part: [^"\\\\]*(\\\\.)* gets repeated as many times as necessary until an unescaped quotation mark turns up (or it reaches the end of the string and the match attempt fails). So I wrapped that part by brackets and added an asterisk. Now it matches: "Some Text", "Even more Text\\"", "Even more text about \\"this text\\"" and "Hello\\\\".为了使它成为可能,我们需要以下部分: [^"\\\\]*(\\\\.)*根据需要重复多次,直到出现未转义的引号(或者它到达字符串的末尾并且匹配尝试失败了。所以我用括号把那部分包裹起来并加了一个星号。现在它匹配:“一些文本”、“更多文本\\”、“关于\\“这个文本\\”的更多文本“和”你好\\” .

In C# code it will look like:在 C# 代码中,它将如下所示:

var r = new Regex("\"([^\"\\\\]*(\\\\.)*)*\"");

BTW, the order of the two main parts: [^"\\\\]* and (\\\\.)* does not matter. You can write:顺便说一句,两个主要部分的顺序: [^"\\\\]*(\\\\.)*无关紧要。你可以写:

"([^"\\]*(\\.)*)*"

or或者

"((\\.)*[^"\\]*)*"

The result will be the same.结果将是相同的。

Now we need to solve another problem: \\"foo\\"-"bar" .现在我们需要解决另一个问题: \\"foo\\"-"bar" The current expression will match to "foo\\"-" , but we want to match it to "bar" . I don't know当前表达式将与"foo\\"-"匹配,但我们希望将其与"bar"匹配。我不知道

why would there be escaped quotes outside of quoted strings为什么带引号的字符串之外会有转义的引号

but we can implement it easily by adding the following part to the beginning: (\\G|[^\\\\]) .但是我们可以通过在开头添加以下部分来轻松实现它: (\\G|[^\\\\]) It says that we want the match start at the point where the previous match ended or after any character except backslash.它表示我们希望匹配在前一个匹配结束的点或除反斜杠之外的任何字符之后开始。 Why do we need \\G ?为什么我们需要\\G This is for the following case, for example: "a""b" .这是针对以下情况,例如: "a""b"

Note that (\\G|[^\\\\])"([^"\\\\]*(\\\\.)*)*" matches -"bar" in \\"foo\\"-"bar" . So, to get only "bar" , we need to specify the group and optionally give it a name, for example "MyGroup". Then C# code will look like:请注意(\\G|[^\\\\])"([^"\\\\]*(\\\\.)*)*"匹配-"bar" in \\"foo\\"-"bar" 。所以,要得到只有"bar" ,我们需要指定组并可选择为其命名,例如 "MyGroup"。然后 C# 代码将如下所示:

[TestMethod]
public void RegExTest()
{
    //Regex compiler: (?:\G|[^\\])(?<MyGroup>"(?:[^"\\]*(?:\.)*)*")
    string pattern = "(?:\\G|[^\\\\])(?<MyGroup>\"(?:[^\"\\\\]*(?:\\\\.)*)*\")";
    var r = new Regex(pattern, RegexOptions.IgnoreCase);

    //Human readable form:       "Some Text"  and  "Even more Text\""     "Even more text about  \"this text\""      "Hello\\"      \"foo\"  - "bar"  "a"   "b" c "d"
    string inputWithQuotedText = "\"Some Text\" and \"Even more Text\\\"\" \"Even more text about \\\"this text\\\"\" \"Hello\\\\\" \\\"foo\\\"-\"bar\" \"a\"\"b\"c\"d\"";
    var quotedList = new List<string>();
    for (Match m = r.Match(inputWithQuotedText); m.Success; m = m.NextMatch())
        quotedList.Add(m.Groups["MyGroup"].Value);

    Assert.AreEqual(8, quotedList.Count);
    Assert.AreEqual("\"Some Text\"", quotedList[0]);
    Assert.AreEqual("\"Even more Text\\\"\"", quotedList[1]);
    Assert.AreEqual("\"Even more text about \\\"this text\\\"\"", quotedList[2]);
    Assert.AreEqual("\"Hello\\\\\"", quotedList[3]);
    Assert.AreEqual("\"bar\"", quotedList[4]);
    Assert.AreEqual("\"a\"", quotedList[5]);
    Assert.AreEqual("\"b\"", quotedList[6]);
    Assert.AreEqual("\"d\"", quotedList[7]);
}

The regular expression正则表达式

(?<!\\)".*?(?<!\\)"

will also handle text that starts with an escaped quote:还将处理以转义引号开头的文本:

\"Some Text\" Some Text "Some Text", and "Some more Text" an""d "Even more text about \"this text\""

A simple answer, without the use of ?一个简单的答案,不使用? , is , 是

"([^\\"]*(\\")*)*\"

or, as a verbatim string或者,作为逐字字符串

@"^""([^\\""]*(\\"")*(\\[^""])*)*"""

It just means:它只是意味着:

  • find the first "找到第一个"
  • find any number of characters that are not \\ or "查找任意数量的不是\\"的字符
  • find any number of escaped quotes \\"查找任意数量的转义引号\\"
  • find any number of escaped characters, that are not quotes找到任意数量的转义字符,不是引号
  • repeat the last three commands until you find "重复最后三个命令,直到找到"

I believe it works as good as @Alan Moore's answer, but, for me, is easier to understand.我相信它和@Alan Moore 的回答一样有效,但对我来说,它更容易理解。 It accepts unmatched ("unbalanced") quotes as well.它也接受不匹配(“不平衡”)的报价。

我知道这是不是最干净的方法,但你的例子,我会前检查字符" ,看它是否是一个\\ ,如果是,我会忽略了报价。

与@Blankasaurus 发布的 RegexBuddy 类似, RegexMagic 也有帮助。

你需要做的任何机会: \\"[^\\"\\\\\\\\]*(?:\\\\.[^\\"\\\\\\\\]*)*\\"

如果您可以定义开始和结束,则以下内容应该有效:

new Regex(@"^(""(.*)*"")$")

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM