简体   繁体   English

如何在正则表达式中处理多个括号?

[英]How can I handle multiple parenthesis in a regex?

I have strings of this type: 我有这种类型的字符串:

text (more text) 文字(更多文字)

What I would like to do is to have a regular expression that extracts the "more text" segment of the string. 我想做的是拥有一个提取字符串的“更多文本”段的正则表达式。 So far I have been using this regular expression: 到目前为止,我一直在使用此正则表达式:

"^.*\\((.*)\\)$"

Which although it works on many cases, it seems to fail if I have something of the sort: 尽管在很多情况下都可以使用,但是如果我有类似的东西,它似乎会失败:

text (more text (even more text)) 文字(更多文字(甚至更多文字))

What I get is: even more text) 我得到的是:更多文字)

What I would like to get instead is: more text (even more text) (basically the content of the outermost pair of brackets.) 我想得到的是:更多文本(甚至更多文本)(基本上是最外面的括号对的内容。)

Besides lazy quantification, another way is: 除了惰性量化,另一种方法是:

"^[^(]*\\((.*)\\)$"

In both regexes, there is a explicitly specified left parenthesis ( "\\\\(" , with Java String escaping) immediately before the matching group. In the original, there was a .* before that, allowing anything (including other left parentheses). In mine, left parentheses are not allowed here (there is a negated character class ), so the explicitly specified left parenthesis in the outermost. 在这两个正则表达式中,在匹配组的前面有一个明确指定的左括号( "\\\\(" ,带有Java字符串转义符)。在原始正则表达式之前,有一个.* ,允许任何内容(包括其他左括号)。在我的系统中,此处不允许使用左括号( 字符类 ),因此在最外面明确指定了左括号。

Try: 尝试:

"^.*?\\((.*)\\)$"

That should make the first matching less greedy. 那应该使第一个匹配的贪婪程度降低。 Greedy means it swallows everything it possibly can while still getting an overall pattern match. 贪婪意味着它吞下了所有可能的东西,同时仍然获得了整个模式匹配。

The other suggestion: 另一个建议:

"^[^(]*\\((.*)\\)$"

Might be more along the line of what you're looking for though. 可能会更符合您的需求。 For this simple example it doesn't really matter so much, but it could if you wanted to expand on the regex, for example by making the part inside the braces optional. 对于这个简单的示例,它并没有多大关系,但是如果您想在正则表达式上进行扩展,例如可以通过将括号内的部分设置为可选,则可以。

I recommend this (double escaping of the backslash removed, since this is not part of the regex): 我建议这样做(删除反斜杠的两次转义,因为这不是正则表达式的一部分):

^[^(]*\((.*)\)

Matching with your version ( ^.*\\((.*)\\)$ ) occurs like this: 与您的版本( ^.*\\((.*)\\)$ )匹配如下:

  1. The star matches greedily, so your first .* goes right to the end of the string. 星号贪婪地匹配,因此您的第一个.*会直接到达字符串的结尾。
  2. Then it backtracks just as much as necessary so the \\( can match - that would be the last opening paren in the string. 然后它将尽可能多地回溯,以使\\(可以匹配-这将是字符串中的最后一个开头括号。
  3. Then the next .* goes right to the end of the string again. 然后,下一个.*再次右移至字符串的末尾。
  4. Then it backtracks just as much so the \\) can match, ie to the last closing paren. 然后它回溯尽可能多的\\)可以匹配,即匹配到最后一个封闭括号。

When you use [^(]* instead of .* , it can't go past the first opening paren, so the first opening paren (the correct one) in the string will delimit your sub-match. 当您使用[^(]*而不是.* ,它不能超过第一个开头括号,因此字符串中的第一个开头括号( 正确的 )将界定您的子匹配项。

尝试这个:

"^.*?\\\\((.*)\\\\)$"

True regular expressions can't count parentheses; 真正的正则表达式不能计算括号; this requires a pushdown automaton. 这需要一个下推自动机。 Some regex libraries have extensions to support this, but I don't think Java's does (could be wrong; Java isn't my forté). 一些正则表达式库具有扩展来支持此功能,但是我不认为Java支持(可能是错误的; Java不是我的要塞)。

BTW, the other answers I've seen so far will work with the example given, but will break with, eg, text (more text (even more text)) (another bit of text) . 顺便说一句,到目前为止,我看到的其他答案将与给出的示例一起使用,但会与例如text (more text (even more text)) (another bit of text)打断。 Changing greediness doesn't make up for the inability to count. 改变贪婪并不能弥补无法计数的不足。

$str =~ /^.*?\((.*)\)/

I think the reason is because you second wildcard is picking up the closing parenthesis. 我认为原因是因为您的第二个通配符要加上右括号。 You'll need to exclude it. 您需要将其排除。

声明:本站的技术帖子网页,遵循CC BY-SA 4.0协议,如果您需要转载,请注明本站网址或者原文地址。任何问题请咨询:yoyou2525@163.com.

 
粤ICP备18138465号  © 2020-2024 STACKOOM.COM